Building Discrete-time Survival Prediction Models: autoSurv

Krithika Suresh, Cameron Severn

Dec 20, 2021

library(devtools)
install_github("ksuresh17/autoSurv")
library(autoSurv)
library(ggplot2)

Introduction

Here, we describe the autoSurv() function developed to train and evaluate the performance of survival prediction models applied to time-to-event data. Currently, the function implements the continuous-time models Cox proportional hazards and random survival forests, and various classification models (logistic regression, elastic net, support vector machines, gradient boosting machines, neural network) in a discrete-time framework. This function performs the following steps:

The resulting outputs from this function are:

autoSurv() function

The autoSurv() function requires the following parameters:

Example 1 - PBC

In this example, we demonstrate the performance of discrete-time survival models using the pbc data set from the survival package. We begin by formatting the data set for use in the analysis. The current implementation requires specification of a complete-case data set so missing observations are excluded. Alternatively, the user can use other functions to impute the missing observations to create a complete data set. Each row should correspond to a unique instance.

data(pbc, package="survival")
pbc$statusComp <- ifelse(pbc$status == 2, 1, 0)
pbc$survTime <- pbc$time/365.25
pbc.dat <- pbc[,c(5,7:22)]
pbc.dat <- pbc.dat[complete.cases(pbc.dat),]

#Standardize covariates
pbc.dat_covs <- subset(pbc.dat, select = -c(statusComp,survTime))
pbc.dat_covs_process <- caret::preProcess(pbc.dat_covs, method=c("center", "scale"))
pbc.dat_covs <- predict(pbc.dat_covs_process, pbc.dat_covs)
pbc.dat <- cbind(pbc.dat[,c("survTime","statusComp")], pbc.dat_covs)

The resulting data set has 276 individuals, with a median follow-up time of 6.7 years and a censoring rate of 60%.

Fit survival models to PBC data

The pbc.dat data set contains the survival time (labeled survTime), an event indicator (labeled statusComp), and 17 predictors.

We use an 80/20 training/test split (testProp=0.2) and perform 5-fold (cv=TRUE, cvFold=5) cross-validation in the training data set to tune hyperparameters.

We specify the prediction time of interest as 3 years (times=3). We consider a maximum of 25 intervals (bins.upper=25) for creating the discrete-time data set. The optimal number of intervals will be tuned during cross-validation using Bayesian optimization. We specify that censored observations should be treated as being censored in the same interval in which they were last observed (cens="same").

The specification of the seed (seed=1101) allows for these results to be reproduced.

The following models are available (trainModels):

Some of these models require hyperparameters that are also tuned using Bayesian optimization. We use the default values for parameters related to the specifications of the Bayesian optimization search scheme (init=10, iter=20). Tuning is performed by minimizing the Brier score of the maximum of the prediction times specified (times) or by minimizing the integrated Brier score when the prediction times is specified as a vector of multiple times.

Note that the following code will take a substantial amount of computation time due to the cross-validation and optimization of the hyperparameters. Of the available discrete-time models, cforest takes the longest to run. We have included only two continuous-time and two discrete-time models in the vector of models (trainModels) to reduce computation time. The optimization cannot be skipped in the current specification, but computation time can be reduced by limiting the number of times the Bayesian Optimization is repeated (iter). The printout indicates the steps in the optimization, which can be silenced (verbose.opt=FALSE).

pbc.results <- autoSurv(timeVar = "survTime", 
                        statusVar = "statusComp", 
                        data = pbc.dat, 
                        times =  c(3),
                        trainModels = c("cox","rsf","glm","nnet"),
                        bins.upper = 25,
                        cens = "half",
                        testProp = 0.2, 
                        seed = 1101,
                        cv = TRUE,
                        cvFold = 5,
                        verbose.opt = FALSE
)
#> [1] "Optimizing RSF"
#> 
#>  Best Parameters Found: 
#> Round = 21   nodesize = 1.0000   mtry = 2.0000   Value = -0.1030211 
#> [1] "Optimizing DiscreteTime-GLM"
#> 
#>  Best Parameters Found: 
#> Round = 8    intervals = 15.0000 Value = -0.1104319 
#> [1] "Optimizing DiscreteTime-nnet"
#> 
#>  Best Parameters Found: 
#> Round = 27   intervals = 5.0000  size = 10.0000  decay = 0.4448258   Value = -0.09642358

View tuned hyperparameters for fitted models

We can view the tuned hyperparameters from the Bayesian optimization for each of the models. We can see the optimal number of intervals selected by tuning for each of the discrete-time models.

pbc.results$tuned_params
#> $rsf
#> nodesize     mtry 
#>        1        2 
#> 
#> $glm
#> intervals 
#>        15 
#> 
#> $nnet
#>  intervals       size      decay 
#>  5.0000000 10.0000000  0.4448258

Assess predictive performance

For the tuned models, we examine the predictive performance in the test data set using the following time-dependent metrics that account for right-censoring:

A data set of all performance metrics can also be obtained ($metrics).

pbc.metrics <- do.call("rbind", pbc.results$CVmetrics)

#sort results by decreasing R2 
pbc.metrics[order(pbc.metrics$R2, decreasing=TRUE),]
#>            AUC      Brier        R2 IBS R2_IBS
#> nnet 0.8429798 0.09642358 0.2514203   0    NaN
#> cox  0.8623867 0.10765434 0.2179865   0    NaN
#> glm  0.8401876 0.11043192 0.2116114   0    NaN
#> rsf  0.8554317 0.10302112 0.2005637   0    NaN

Plot predictions for an individual

We can obtain the predicted probabilities at a set of prediction times for a particular individual using the predict_autoSurv() function, and extract the predicted survival probabilities for any of the models that were fit.

times_pred <- seq(0, 3, by=0.01)
pbc.testResults.id <- predict_autoSurv(pbc.results, 
                                    newdata=pbc.dat[4,], 
                                    times=times_pred, 
                                    timeVar="survTime", 
                                    statusVar="statusComp")

pred_probs <- pbc.testResults.id$pred_probabilities
mods <- names(pred_probs)
pred_probs_mat <- data.frame("model"=rep(mods, each=length(times_pred)),
                             "times"=rep(times_pred, length(mods)),
                             "probs"=unlist(pred_probs))
pbc.testid <- ggplot(data = pred_probs_mat, aes(y = probs, x = times, color = model)) +
  geom_step(size=0.8) +
  theme_light() + theme(text=element_text(size=15)) + xlab("Prediction times") +
  xlab("Prediction times") + xlim(0,3) + 
  ylab("Predicted survival probability") +
  guides(color=guide_legend(title=""))
pbc.testid

Example 2 - Colon cancer

In this example, we demonstrate the performance of discrete-time survival models using the colon data set from the survival package. We begin by formatting the data set for use in the analysis.

We use a 60/20/20 training/validation/test split sample, where 60% of the data is used to train the data set, 20% is used as an internal validation data set to tune the hyperparameters, and 20% is used as a test data set. We can specify the 60/20 split within the autoSurv() function, but we must first identify the 20% sample for the test data set.

set.seed(1101)
data(colon, package="survival")
#> Warning in data(colon, package = "survival"): data set 'colon' not found
colon$survTime <- colon$time/365.25
colon <- subset(colon, etype==2)
colon.dat <- subset(colon, select=-c(id,study,etype,time))
colon.dat <- colon.dat[complete.cases(colon.dat),]

#Standardize covariates
colon.dat_covs <- subset(colon.dat, select = -c(status,survTime))
colon.dat_covs_process <- caret::preProcess(colon.dat_covs, method=c("center", "scale"))
colon.dat_covs <- predict(colon.dat_covs_process, colon.dat_covs)
colon.dat <- cbind(colon.dat[,c("survTime","status")], colon.dat_covs)

#Select a 20% data set as an external test set 
colon.dat_val.ids <- sample(1:nrow(colon.dat), size=nrow(colon.dat)*0.2, replace=FALSE) 
colon.dat_train <- colon.dat[-colon.dat_val.ids, ]
colon.dat_val <- colon.dat[colon.dat_val.ids, ]

The resulting data set has 888 individuals, with a median follow-up time of 6.4 years and a censoring rate of 52%.

Fit survival models to Colon cancer data

The colon.dat data set contains the survival time (labeled survTime), an event indicator (labeled status), and 17 predictors.

We use a 60/20/20 training/validation/test split. Since we have already identified the 20% for the test data set, to select 20% of the original data set for the validation data set we specify the proportion as testProp=0.2/(1-0.2). We don’t use cross-validation in this example (cv=FALSE). Thus, tuning of the hyperparameters is done by optimizing performance in the test data set.

We specify the prediction times of interest as the sequence of times 1, 2, 3, 4, 5 years (times=c(1,2,3,4,5)). We have included only two continuous-time and two discrete-time models in the vector of models (trainModels) to reduce computation time. We consider a maximum of 25 intervals (bins.upper=25) for creating the discrete-time data set. The optimal number of intervals will be tuned using Bayesian optimization to minimize the integrated Brier score across the specified times in the test data set. We specify that censored observations should be treated as being censored in an interval if their survival time is observed to be at least half way through the interval (cens="half").

The specification of the seed (seed=1101) allows for these results to be reproduced. We silence the printout for the Bayesian optimization (verbose.opt=FALSE). We use default specifications for the other parameters. The following code takes less computational time than above where cross-validation was used.

colon.results <- autoSurv(timeVar = "survTime", 
                          statusVar = "status", 
                          data = colon.dat_train, 
                          times =  c(1, 2, 3, 4, 5),
                          trainModels = c("cox","rsf","glm","gbm"), 
                          bins.upper = 25,
                          testProp = 0.2/(1-0.2), #since we have removed 20% for the test data set
                          seed = 1101,
                          cv = FALSE,
                          verbose.opt = FALSE
)
#> [1] "Optimizing RSF"
#> 
#>  Best Parameters Found: 
#> Round = 10   nodesize = 10.0000  mtry = 2.0000   Value = -0.1273826 
#> [1] "Optimizing DiscreteTime-GLM"
#> 
#>  Best Parameters Found: 
#> Round = 10   intervals = 7.0000  Value = -0.1230428 
#> [1] "Optimizing DiscreteTime-GBM"
#> 
#>  Best Parameters Found: 
#> Round = 19   intervals = 19.0000 n.trees = 154.0000  interaction.depth = 3.0000  shrinkage = 0.04185465  n.minobsinnode = 19.0000    Value = -0.1231758

View tuned hyperparameters for fitted models

We can view the tuned hyperparameters from the Bayesian optimization for each of the models. We can see the optimal number of intervals selected by tuning for each of the discrete-time models.

colon.results$tuned_params
#> $rsf
#> nodesize     mtry 
#>       10        2 
#> 
#> $glm
#> intervals 
#>         7 
#> 
#> $gbm
#>         intervals           n.trees interaction.depth         shrinkage 
#>       19.00000000      154.00000000        3.00000000        0.04185465 
#>    n.minobsinnode 
#>       19.00000000

Assess predictive performance in a new data set

We fit the trained prediction models to a new data set using the predict_autoSurv() function. We specify that the models should be evaluated on the validation data set (colon.dat_val) at the time points 1, 2, 3, 4, 5 years (times=1:5). We specify the column in the data set for survival time (timeVar="survTime") and event status (statusVar="status") to evaluate the predictive performance of the data set.

colon.testResults <- predict_autoSurv(colon.results, 
                                   newdata=colon.dat_val, 
                                   times=1:5, 
                                   timeVar="survTime", 
                                   statusVar="status")

Plot of prediction metrics

We can plot the prediction metrics over the specified times.

ggplot(data = colon.testResults$auc, aes(y = AUC,x = times, color = model)) +
  geom_point(size=2) +
  geom_line(size=0.8) +
  theme_light() + theme(text=element_text(size=15)) + 
  xlab("Prediction times") + ylab("AUC") +
  guides(color=guide_legend(title=""))

ggplot(data = colon.testResults$brier, aes(y = Brier, x = times, color = model)) +
  geom_point(size=2) +
  geom_line(size=0.8) +
  theme_light() + theme(text=element_text(size=15)) + 
  xlab("Prediction times") + ylab("Brier score") +
  guides(color=guide_legend(title=""))