International Journal of Statistics and Applications
p-ISSN: 2168-5193 e-ISSN: 2168-5215
2021; 11(2): 37-49
doi:10.5923/j.statistics.20211102.03
Received: Jul. 20, 2021; Accepted: Aug. 6, 2021; Published: Aug. 15, 2021

Hellen Wanjiru Waititu1, Joseph K. Arap Koske1, Nelson Owuor Onyango2
1School of Physical and Biological Sciences, Moi University, Eldoret, Kenya
2School of Mathematics, University of Nairobi, Nairobi, Kenya
Correspondence to: Hellen Wanjiru Waititu, School of Physical and Biological Sciences, Moi University, Eldoret, Kenya.
| Email: | ![]() |
Copyright © 2021 The Author(s). Published by Scientific & Academic Publishing.
This work is licensed under the Creative Commons Attribution International License (CC BY).
http://creativecommons.org/licenses/by/4.0/

Under Five Child Mortality (U5CM) remains a major health problem in the developing world. The Sustainable Development Goals target of 25 deaths per 1000 live births has not yet been achieved in many Low and Middle Income Countries (LMIC). This study used the Kenya Demographic and Health Survey (KDHS) data (2014) to understand the determinants of U5CM. KDHS (2014) data is characterized by high dimensionality, high imbalance and violation of Proportional Hazard (PH) assumptions among other statistical challenges. This study aimed at handling the problem of non proportional hazard assumptions that characterize covariates of survival regression models. To achieve this we used various split rules, namely: log-rank, log-rank score and Bs.gradient splitting rules. The data used was balanced using Random Under-sampling method. The balanced data was integrated in RSF for variable selection while applying the three specified splitting rules. Respective selected variables were fitted in the Cox Aalen’s model for prediction while model selection was carried out using concordance index. The model with log-rank splitting rule recorded the highest concordance of 0.916 followed by Bs.gradient with a concordance of 0.864 while log-rank score resulted in a concordance of 0.799. In conclusion, the results from the analysis presented in this paper show the superiority of log-rank splitting rule. However, optimality of log-rank is achieved when the hazard is proportional over time. Some of the variables in the data were found to violate the PH assumption making the use of log-rank splitting rule not optimal. According to our analysis, we settle on Bs.gradient splitting method which still has a high concordance index of 0.86 and smaller error rate of 0.028. Using Balanced Random Survival Forests (BRSF) with Bs.gradient splitting rule, the identified determinants of U5CM are; V207 (sum of deceased daughters), V219 (sum total of living children) and B8 (age of the child). Hence, the age of the child and the siblings’ information are identified as some of the key determinants of U5CM.
Keywords: Splitting rules, Balanced Random Survival Forests, Under Five Child Mortality, Cox Aalen’s model
Cite this paper: Hellen Wanjiru Waititu, Joseph K. Arap Koske, Nelson Owuor Onyango, Analysis of Balanced Random Survival Forest Using Different Splitting Rules: Application on Child Mortality, International Journal of Statistics and Applications, Vol. 11 No. 2, 2021, pp. 37-49. doi: 10.5923/j.statistics.20211102.03.
|
![]() | Figure 1. General survival curves for the data used |
![]() | Figure 2. Survival curves by covariates |
bootstrap samples from the original data having
samples. On average, each bootstrap sample sets aside 37% of the data named as Out of Bag (OOB) data with respect to the bootstrap sample and each sample has
predictors.b) For each bootstrapped sample, a survival tree is developed. This is done by randomly choosing
out of
variables in
for splitting on. The value of
depends on the number of available predictors and is data specific. All the
bootstrap samples are designated to the top most node of the tree which is also referred to as the root node. This root node is then separated into two daughter nodes each of which is recursively split progressively maximizing survival difference between daughter nodes.c) The trees are grown to full size where the end is indicated by the restriction that the endmost node should have larger than or equal to
unique events.d) After the tree is fully grown, the in-bag and out of bag (OOB) ensemble estimators are computed by taking the mean value over all the trees predictors.e) The ensemble OOB error is calculated using the first
trees, where
.f) OOB estimation is used to calculate the Variable Importance (VIMP) [17].By averaging over all trees, a reliable measure of importance of a variable regarding time to event can be obtained [18].RSF gives a measure of VIMP which is totally nonparametric. VIMP has been found to be effective in many applied settings for selecting variables [19], [20], [21], [22], [23]. In this study, using the RSF model, the highly predictive risk factors using three different splitting rules were extracted.
where
is defined as the minimum of the event and censoring time. Hence
where
is the event time and
the censoring time.
is the censoring indicator defined as
.While growing a tree in RSF, node splitting must take censoring into consideration. With reference to the RSF algorithm, a forest develops from randomly drawn
bootstrap samples each of which becomes the root of each tree in the forest. There being
predictors in each bootstrap sample,
predictors are randomly chosen for splitting on. Suppose we take
to be the top most node of the tree which is to be split into two daughter nodes. Within the node, there exist
predictors and
observations. The splitting process is as follows [24]. • Take any predictor
from the
predictors.• Find the splitting value
such that the survival difference between
and
for predictor
is maximum. In this case
splits to the left daughter node while
splits to the right daughter node.• Calculate the survival difference between the two daughter nodes using a pre-determined splitting method.• Take another split value
in predictor
until we get a split value which results in maximum survival difference for predictor
.• From the remaining
predictors in the node the process is repeated until we get predictor
and split value
which give maximum survival difference between the two daughter nodes.• Applying the node splitting process in each of the new daughter nodes and recursively partitioning the nodes leads to the growth of the tree.• The process is applied to all the
root nodes leading to the growth the forest.When survival difference is maximum, unlike cases with respect to survival are pushed apart by the tree. Increase in the number of nodes causes dissimilar cases to separate more. This results in homogeneous nodes in the tree consisting of cases with similar survival. In this research, the following splitting methods were used to calculate the survival difference between any two daughter nodes.
of a tree using log-rank splitting rule. The data at the node is presented as
where
is the
predictor,
and
represent the
survival duration and censoring status respectively. The information at time
can be summarized as in the table below.
Where,
stands for the number of events in daughter node
at time
.
represent individuals who are alive in daughter node j,
at time
is the number of
where
is the duration of survival for the
individual and
the distinct event time in node 
is the number of 
For a split using covariate
and its splitting value
The survival difference between any two daughter nodes is calculated using log-rank test given as;
This equation measures the magnitude of separation between two daughter nodes. The best split is given by the greatest difference between the two daughter nodes which is given by the largest value of
[25].
are computed given an ordered predictor
such that 
The rank for each time
is calculated as
where
the number of
. Let
and
be the sample mean and sample variance for
for
The formula for log-rank score test is given by
This split rule gives the magnitude of node separation by
where the best split is given by the maximum value over
and 
be the
likelihood prediction in a series of
such predictions. The paired observation is given as
if the event of interest occurs on the
occasion, and
otherwise. The BS is the mean-squared error over the
pairs of prediction observations,
In this case, the time horizon used for the Brier score is set to the
percentile of the observed event times which must be a value between 0 and 1. Suppose we have a pair of predictor-response, say
for
. The usual regression procedure attaches the conditional average of the response variable
to a specified set of predictors
[27] introduced Quantile Regression Forests (QRF) which connects between an empirical cumulative distribution function and the outputs of a tree. Let
be a group of randomly selected variables to be split into two daughter nodes
and
. Suppose the homogeneity of each group is defined by
where
is the sample mean in
For an optimal splitting selection, comparison is done between the homogeneities of
and
with that of
. The splitting value
is the one that maximizes
Where
is a randomly selected sample of predictors from the predictor space
. The resulting nodes are recursively split until the stopping criterion is reached. The terminal node gives the predicted value. [28] Suggested that instead of maximizing variance heterogeneity of the daughter nodes, one maximizes the criterion
where 
is an indicator function which takes a value of 1 when
is more than the
quantile
of the observations of node
The selection of
is connected with a gradient based approximation of the quantile function 
, hence the term gradient forest. The order for each split is chosen among given orders
.
bootstrap samples from the original data. Each bootsrap sample sets aside on average 37% of the data called
data while the remaining 63% is called the in-bag data. The in-bag data is used to grow the tree and gives estimators which are used for prediction. On the other hand, the
data is not involved in the growth of the tree but used for cross-validation purposes. RSF estimates cumulative hazard function (CHF) and survival function based on the terminal nodes using the in-bag and out-of-bag estimators.
denote the terminal node of a tree.
Indicate the distinct event times within node h,
Indicate the number of deaths at time
and
Indicate the number of individuals at risk at time
The CHF for node
is approximated using the bootstrapped Nelson–Aalen estimators;
This implies that for a given tree, the hazard estimate for node
is the ratio of events to individuals at risk summed across all unique event times. Each terminal node of a tree provides a sequence of such estimates and each individual in node
has the same CHF.The survival function for node
is estimated using bootstrapped Kaplan Meier estimator;
This gives the estimates for the individuals in node
at a given time
To estimate the CHF for a given predictor 
and the survival function of a given predictor 
is dropped down the tree and ends up in a distinct endmost node as a result of the binary nature of the tree. This implies that
and
.This defines the CHF and survival function for all individuals in the data and the estimates for the tree. Due to bootstrapping (sampling with replacement) an observation can be found in various bootstrap samples and hence in various trees.The in-bag ensemble estimators are computed by averaging the trees estimators. Hence the in-bag ensemble CHF and survival estimators are respectively given as
and 
be an indicator pointing to whether case
is in-bag or out of bag such that
To determine the CHF and survival estimators for an
case
, the case is dropped down the tree to a endmost node
. The OOB CHF and survival estimators for
becomes
and
respectively.The OOB ensemble estimators are calculated by getting the mean of the OOB tree estimators. Hence the OOB ensemble estimators are given as
and
where,
is the indicator of the risk.
is a
vector of covariates where
is the additive non parametric time varying covariate and
are the covariates with constant multiplicative effects.
Is a
vector of time varying regression coefficients and
is a
vector of relative risk regression coefficients.Comparison of prediction accuracy of the different models was done based on concordance index.
|
|
![]() | Table 4. Statistical Tests (Test for PH assumption). PH assumption is supported by non significant P-values |
![]() | Figure 3. Schoenfeld residuals |
|
) are V206, V207, B7, and B8 from the log-rank model, V207, V219, and B8 from the Bs.gradient model and V218, B12 and ML1 in the log-rank score model. To compare the different models, concordance index was used in order to determine the effect of the various splitting methods and results shown in table 6.
|