library(devtools)
install_github("ksuresh17/autoSurv")
library(autoSurv)
library(ggplot2)
Here, we describe the autoSurv()
function developed to
train and evaluate the performance of survival prediction models applied
to time-to-event data. Currently, the function implements the
continuous-time models Cox proportional hazards and random survival
forests, and various classification models (logistic regression, elastic
net, support vector machines, gradient boosting machines, neural
network) in a discrete-time framework. This function performs the
following steps:
The resulting outputs from this function are:
The autoSurv()
function requires the following
parameters:
timeVar
: string of the name of the column within the
data set that denotes the time-to-event outcome.
statusVar
: string of the name of the column within
the data set that denotes the event indicator. Variable should be coded
as 1 for event, 0 for censoring.
data
: dataframe containing timeVar, statusVar,
idVar, and other variables to be used in time-to-event
prediction.
times
: optional list of time horizon(s) of interest.
If not specified, will be taken as the median of the event times. If
specified as a single value, hyperparameter tuning optimizes predictive
accuracy using Brier score at the specified time. If specified as a
vector, hyperparameter tuning optimizes predictive accuracy using
Integrated Brier score over the specified times. Hyperparameter
prediction models will optimize predictive accuracy of the test data at
these times.
trainModels
: vector of models to train. Current
options are continuous-time models: “cox”,“rsf”, discrete-time models:
“glm”,“gam”,“gbm”,“glmnet”,“svm”,“cforest”,“nnet”
bins.lower
: minimum number of discrete-time
intervals used to build the discrete-time data set. Default is
5.
bins.upper
: maximum number of discrete-time
intervals used to build the discrete-time data set. Default is 10. The
number of intervals will be tuned as a hyperparameter and the range of
possible values is [bins.lower
,
bins.upper
].
cens
: string indicating which method to use for
handling censored observations. Options are:
testProp
: proportion of samples to be separated into
a test set. This is the data set on which the performance metrics are
assessed. The reamining observations are used as the training set on
which the model is built and hyperparameters are tuned.
seed
: integer value to set a random seed for
reproducibility.
init
: number of initialization points for
hyperparameter optimization. Applies to all models trained. Choose a
lower value to reduce computation time.
iter
: number of iterations of hyperparameter
optimization after initialization. Applies to all models trained. Choose
a lower value to reduce computation time.
cv
: enable or disable cross-validation in the
training data for the tuning of hyperparameters. If cv=TRUE
the hyperparameters are tuned to optimize the cross-validated
performance metrics. If cv=FALSE
the hyperparameters are
tuned to optimize the performance metrics computed in the test data
set.
cvfold
: integer specifying the number of folds to
use in k-fold cross-validation
verbose.opt
: enable or disable printing of the
Bayesian optimization progress
In this example, we demonstrate the performance of discrete-time
survival models using the pbc
data set from the
survival
package. We begin by formatting the data set for
use in the analysis. The current implementation requires specification
of a complete-case data set so missing observations are excluded.
Alternatively, the user can use other functions to impute the missing
observations to create a complete data set. Each row should correspond
to a unique instance.
data(pbc, package="survival")
$statusComp <- ifelse(pbc$status == 2, 1, 0)
pbc$survTime <- pbc$time/365.25
pbc<- pbc[,c(5,7:22)]
pbc.dat <- pbc.dat[complete.cases(pbc.dat),]
pbc.dat
#Standardize covariates
<- subset(pbc.dat, select = -c(statusComp,survTime))
pbc.dat_covs <- caret::preProcess(pbc.dat_covs, method=c("center", "scale"))
pbc.dat_covs_process <- predict(pbc.dat_covs_process, pbc.dat_covs)
pbc.dat_covs <- cbind(pbc.dat[,c("survTime","statusComp")], pbc.dat_covs) pbc.dat
The resulting data set has 276 individuals, with a median follow-up time of 6.7 years and a censoring rate of 60%.
The pbc.dat
data set contains the survival time (labeled
survTime
), an event indicator (labeled
statusComp
), and 17 predictors.
We use an 80/20 training/test split (testProp=0.2
) and
perform 5-fold (cv=TRUE
, cvFold=5
)
cross-validation in the training data set to tune hyperparameters.
We specify the prediction time of interest as 3 years
(times=3
). We consider a maximum of 25 intervals
(bins.upper=25
) for creating the discrete-time data set.
The optimal number of intervals will be tuned during cross-validation
using Bayesian optimization. We specify that censored observations
should be treated as being censored in the same interval in which they
were last observed (cens="same"
).
The specification of the seed (seed=1101
) allows for
these results to be reproduced.
The following models are available (trainModels
):
cox
) and
Random survival forest (rsf
)glm
),
elastic net (glmnet
), gradient boosting machines
(gbm
), support vector machines (svm
),
conditional inference forest (cforest
), and neural networks
(nnet
)Some of these models require hyperparameters that are also tuned
using Bayesian optimization. We use the default values for parameters
related to the specifications of the Bayesian optimization search scheme
(init=10
, iter=20
). Tuning is performed by
minimizing the Brier score of the maximum of the prediction times
specified (times
) or by minimizing the integrated Brier
score when the prediction times is specified as a vector of multiple
times.
Note that the following code will take a substantial amount of
computation time due to the cross-validation and optimization of the
hyperparameters. Of the available discrete-time models,
cforest
takes the longest to run. We have included only two
continuous-time and two discrete-time models in the vector of models
(trainModels
) to reduce computation time. The optimization
cannot be skipped in the current specification, but computation time can
be reduced by limiting the number of times the Bayesian Optimization is
repeated (iter
). The printout indicates the steps in the
optimization, which can be silenced
(verbose.opt=FALSE
).
<- autoSurv(timeVar = "survTime",
pbc.results statusVar = "statusComp",
data = pbc.dat,
times = c(3),
trainModels = c("cox","rsf","glm","nnet"),
bins.upper = 25,
cens = "half",
testProp = 0.2,
seed = 1101,
cv = TRUE,
cvFold = 5,
verbose.opt = FALSE
)#> [1] "Optimizing RSF"
#>
#> Best Parameters Found:
#> Round = 21 nodesize = 1.0000 mtry = 2.0000 Value = -0.1030211
#> [1] "Optimizing DiscreteTime-GLM"
#>
#> Best Parameters Found:
#> Round = 8 intervals = 15.0000 Value = -0.1104319
#> [1] "Optimizing DiscreteTime-nnet"
#>
#> Best Parameters Found:
#> Round = 27 intervals = 5.0000 size = 10.0000 decay = 0.4448258 Value = -0.09642358
We can view the tuned hyperparameters from the Bayesian optimization for each of the models. We can see the optimal number of intervals selected by tuning for each of the discrete-time models.
$tuned_params
pbc.results#> $rsf
#> nodesize mtry
#> 1 2
#>
#> $glm
#> intervals
#> 15
#>
#> $nnet
#> intervals size decay
#> 5.0000000 10.0000000 0.4448258
For the tuned models, we examine the predictive performance in the test data set using the following time-dependent metrics that account for right-censoring:
$CVmetrics
)$auc
): Ranges from 0 to 1, where values closer to
1 indicate better discrimination, and 0.5 indicates discriminative
ability similar to chanceBrier
column in $brier
):
Similar to mean squared error, where values closer to 0 indicate better
overall predictive performance. The “Null model” indicates the
performance of a model with no covariates and can be used as a benchmark
for assessing the performance of the other models.R2
column in $brier
): Scaled
Brier score computed as (Model Brier score)/(Null model Brier score),
where values closer to 1 indicate better overall predictive
performance.IBS
column in
$brier
): Integrates the Brier score over the prediction
times (times
) specified in autoSurv
. Will be 0
if there is only one prediction time specified.A data set of all performance metrics can also be obtained
($metrics
).
<- do.call("rbind", pbc.results$CVmetrics)
pbc.metrics
#sort results by decreasing R2
order(pbc.metrics$R2, decreasing=TRUE),]
pbc.metrics[#> AUC Brier R2 IBS R2_IBS
#> nnet 0.8429798 0.09642358 0.2514203 0 NaN
#> cox 0.8623867 0.10765434 0.2179865 0 NaN
#> glm 0.8401876 0.11043192 0.2116114 0 NaN
#> rsf 0.8554317 0.10302112 0.2005637 0 NaN
We can obtain the predicted probabilities at a set of prediction
times for a particular individual using the
predict_autoSurv()
function, and extract the predicted
survival probabilities for any of the models that were fit.
<- seq(0, 3, by=0.01)
times_pred <- predict_autoSurv(pbc.results,
pbc.testResults.id newdata=pbc.dat[4,],
times=times_pred,
timeVar="survTime",
statusVar="statusComp")
<- pbc.testResults.id$pred_probabilities
pred_probs <- names(pred_probs)
mods <- data.frame("model"=rep(mods, each=length(times_pred)),
pred_probs_mat "times"=rep(times_pred, length(mods)),
"probs"=unlist(pred_probs))
<- ggplot(data = pred_probs_mat, aes(y = probs, x = times, color = model)) +
pbc.testid geom_step(size=0.8) +
theme_light() + theme(text=element_text(size=15)) + xlab("Prediction times") +
xlab("Prediction times") + xlim(0,3) +
ylab("Predicted survival probability") +
guides(color=guide_legend(title=""))
pbc.testid
In this example, we demonstrate the performance of discrete-time
survival models using the colon
data set from the
survival
package. We begin by formatting the data set for
use in the analysis.
We use a 60/20/20 training/validation/test split sample, where 60% of
the data is used to train the data set, 20% is used as an internal
validation data set to tune the hyperparameters, and 20% is used as a
test data set. We can specify the 60/20 split within the
autoSurv()
function, but we must first identify the 20%
sample for the test data set.
set.seed(1101)
data(colon, package="survival")
#> Warning in data(colon, package = "survival"): data set 'colon' not found
$survTime <- colon$time/365.25
colon<- subset(colon, etype==2)
colon <- subset(colon, select=-c(id,study,etype,time))
colon.dat <- colon.dat[complete.cases(colon.dat),]
colon.dat
#Standardize covariates
<- subset(colon.dat, select = -c(status,survTime))
colon.dat_covs <- caret::preProcess(colon.dat_covs, method=c("center", "scale"))
colon.dat_covs_process <- predict(colon.dat_covs_process, colon.dat_covs)
colon.dat_covs <- cbind(colon.dat[,c("survTime","status")], colon.dat_covs)
colon.dat
#Select a 20% data set as an external test set
<- sample(1:nrow(colon.dat), size=nrow(colon.dat)*0.2, replace=FALSE)
colon.dat_val.ids <- colon.dat[-colon.dat_val.ids, ]
colon.dat_train <- colon.dat[colon.dat_val.ids, ] colon.dat_val
The resulting data set has 888 individuals, with a median follow-up time of 6.4 years and a censoring rate of 52%.
The colon.dat
data set contains the survival time
(labeled survTime
), an event indicator (labeled
status
), and 17 predictors.
We use a 60/20/20 training/validation/test split. Since we have
already identified the 20% for the test data set, to select 20% of the
original data set for the validation data set we specify the proportion
as testProp=0.2/(1-0.2)
. We don’t use cross-validation in
this example (cv=FALSE
). Thus, tuning of the
hyperparameters is done by optimizing performance in the test data
set.
We specify the prediction times of interest as the sequence of times
1, 2, 3, 4, 5 years (times=c(1,2,3,4,5)
). We have included
only two continuous-time and two discrete-time models in the vector of
models (trainModels
) to reduce computation time. We
consider a maximum of 25 intervals (bins.upper=25
) for
creating the discrete-time data set. The optimal number of intervals
will be tuned using Bayesian optimization to minimize the integrated
Brier score across the specified times in the test data set. We specify
that censored observations should be treated as being censored in an
interval if their survival time is observed to be at least half way
through the interval (cens="half"
).
The specification of the seed (seed=1101
) allows for
these results to be reproduced. We silence the printout for the Bayesian
optimization (verbose.opt=FALSE
). We use default
specifications for the other parameters. The following code takes less
computational time than above where cross-validation was used.
<- autoSurv(timeVar = "survTime",
colon.results statusVar = "status",
data = colon.dat_train,
times = c(1, 2, 3, 4, 5),
trainModels = c("cox","rsf","glm","gbm"),
bins.upper = 25,
testProp = 0.2/(1-0.2), #since we have removed 20% for the test data set
seed = 1101,
cv = FALSE,
verbose.opt = FALSE
)#> [1] "Optimizing RSF"
#>
#> Best Parameters Found:
#> Round = 10 nodesize = 10.0000 mtry = 2.0000 Value = -0.1273826
#> [1] "Optimizing DiscreteTime-GLM"
#>
#> Best Parameters Found:
#> Round = 10 intervals = 7.0000 Value = -0.1230428
#> [1] "Optimizing DiscreteTime-GBM"
#>
#> Best Parameters Found:
#> Round = 19 intervals = 19.0000 n.trees = 154.0000 interaction.depth = 3.0000 shrinkage = 0.04185465 n.minobsinnode = 19.0000 Value = -0.1231758
We can view the tuned hyperparameters from the Bayesian optimization for each of the models. We can see the optimal number of intervals selected by tuning for each of the discrete-time models.
$tuned_params
colon.results#> $rsf
#> nodesize mtry
#> 10 2
#>
#> $glm
#> intervals
#> 7
#>
#> $gbm
#> intervals n.trees interaction.depth shrinkage
#> 19.00000000 154.00000000 3.00000000 0.04185465
#> n.minobsinnode
#> 19.00000000
We fit the trained prediction models to a new data set using the
predict_autoSurv()
function. We specify that the models
should be evaluated on the validation data set
(colon.dat_val
) at the time points 1, 2, 3, 4, 5 years
(times=1:5
). We specify the column in the data set for
survival time (timeVar="survTime"
) and event status
(statusVar="status"
) to evaluate the predictive performance
of the data set.
<- predict_autoSurv(colon.results,
colon.testResults newdata=colon.dat_val,
times=1:5,
timeVar="survTime",
statusVar="status")
We can plot the prediction metrics over the specified times.
ggplot(data = colon.testResults$auc, aes(y = AUC,x = times, color = model)) +
geom_point(size=2) +
geom_line(size=0.8) +
theme_light() + theme(text=element_text(size=15)) +
xlab("Prediction times") + ylab("AUC") +
guides(color=guide_legend(title=""))
ggplot(data = colon.testResults$brier, aes(y = Brier, x = times, color = model)) +
geom_point(size=2) +
geom_line(size=0.8) +
theme_light() + theme(text=element_text(size=15)) +
xlab("Prediction times") + ylab("Brier score") +
guides(color=guide_legend(title=""))