# Random Forests for Survival, Regression, and Classification

Udaya Kogalur & Hemant Ishwaran

## Table of Contents

## Introduction

Random Forests for Survival, Regression, and Classification

(RF-SRC) is an ensemble tree method for the analysis of data sets

using a variety of models. As is well known, constructing ensembles

from base learners such as trees can significantly improve learning

performance. It was shown by [Breiman,

2001] that ensemble learning can be further improved by

injecting randomization into the base learning process — a method

called Random Forests. RF-SRC extends Breiman’s Random Forests

method and provides a unified treatment of the methodology for

models including right censored survival (single and multiple event

competing risk), multivariate regression or classification, and

mixed outcome (more than one continuous, discrete, and/or

categorical outcome). When one continuous or categorical outcome is

present, the model reduces to univariate regression or

classification respectively. When no outcome is present, the model

implements unsupervised learning. The [Five Models]
figure summarizes the situation.

RF-SRC introduces new split rules for each model and gives the

user the ability to define and code custom split

rules. Deterministic or random splitting is available for all

models. Variable predictiveness can be assessed using variable

importance measures [Ishwaran et

al., 2007] for single as well as grouped variables. Variable

selection is implemented using minimal depth [Ishwaran et al., 2010]. Missing data

(for x-variables and y-outcomes) can be imputed on both training and

test data. The package supports shared memory parallel processing on

desktop computers via OpenMP. The package also supports

hybrid parallel processing via OpenMP & Open MPI on

clusters that implement distributed memory. Execution time is

reduced almost linearly with respect to the cores available.

An overview of this document follows: In the [package overview]
we describe the Random Forests algorithm highlighting the recursivity

that is at its core. The formulas, split rules, terminal node

estimators, and prediction errors used in the five models are

described briefly. In the section on [Variable Selection] the primary approaches for variable

predictiveness, variable selection and identifying interactions

between variables are discussed. In the section on [Imputation]
the approach to handle missingness is discussed. In the section on

[Prediction] a brief discussion of training,

restoration, test data, and pruning is given. In the section

on [Hybrid Parallel Processing]
we demonstrate that the package is capable of hybrid parallel

computation in a non-trivial and massive way, and provide details

of how this can be achieved. In the [Theory and

Specifications] section we provide the mathematics for the split rules,

terminal node estimators, and prediction error for each family.

## Package Overview

Building a Random Forests model involves growing a binary tree [Breiman, 2001] using user supplied training

data and parameters. As is shown in the [Five Models]
figure, data types must be real valued,

discrete or categorical. The response can be right-censored time and

censoring information, or any combination of real, discrete or

categorical information. The response can also be absent entirely. The

resulting forest contains many useful values which can be directly

extracted by the user and parsed using additional functions. The basic

algorithm for growing a forest is provided in this [Recursive Algorithm]. The process iterates over

`ntree`

, the number of trees that are to be grown. In

practice, the iteration is actually parallelized and trees are grown

concurrently, not iteratively. This is explained in more detail in

the section on [Hybrid Parallel Processing]. The

recursive nature of the algorithm is reflected in the repeated calls

to split a node until conditions determine that a node is

terminal. Another key aspect of the algorithm is the injection of

randomization during model creation. Randomization reduces

variation. Bootstrapping [Breiman,

1996] at the root node reduces variation. Feature selection is

also randomized with the use of the parameter `mtry`

. In

the [Recursive Algorithm] `N`

is defined as the

number of records in the data set, and `P`

as the number of

x-variables in the data set. The parameter `mtry`

is such

that ` 1 <= mtry <= P`

. At each node, the

algorithm selects ` mtry `

random x-variables according to

a probability vector ` xvar.wt`

. The resulting subset of

x-variables are examined for best splits according to a

. The parameter

splitrule`nsplit`

also allows one to

specify the number of random split points at which an x-variable is

tested. The depth of trees can be controlled using the parameters

`nodesize`

and `nodedepth`

. The parameter

`nodesize `

ensures that the average nodesize across the

forest will be at least `nodesize`

. The parameter

`nodedepth`

forces the termination of splitting when the depth of a node

reaches the value specified. Node depth is zero-based from the root

node onward. Reasonable models can be formed with the judicious

selection of `mtry`

, `nsplit`

,

`nodesize`

, and `nodedepth `

without exhaustive

and deterministic splitting.

A simulated classification training data set is shown on the left

of the figure labeled [Tree Decision Boundary]. The data set has two real valued

features, $x_1$ and $x_2$.

The response is a class label that can take on four values. The class sizes are equal. Each

class covers a circular area. The circles touch at a single point – the origin. At this point, four data points

with all four class labels exist. On the right of the figure is a tree produced from the training data

with associated split points labeled. The tree was grown using bootstrapping [Breiman, 1996].

Bootstrapping results in well-defined subsets. The bootstrap, some of which are duplicates,

are members of the in-bag (IB) subset. The remaining individuals define the out-of-bag (OOB)

subset for the tree. See [Breiman, 1996] for a detailed description of these concepts. Only the

bootstrap is used to train the model and to define terminal node statistics. In this model,

the terminal node statistic is the resulting frequency distribution of the class labels in a terminal node. This

distribution is shown under each terminal node in the right of the figure. The decision boundary on $x_1$

and $x_2$ formed by the tree is superimposed on the data space along with the six terminal node

labels.

The OOB subset for a tree does not play a role in determining its topology. Each individual in the OOB

subset for a tree is passive and assigned a unique terminal node membership and terminal node

statistic. An OOB ensemble statistic for each individual is formed by combining the terminal

node statistics from all trees in which an individual is an OOB member. The class with the

maximum frequency in the OOB ensemble statistic serves as the predicted class label for the

member. In the figure labeled [Forest Decision Boundary] a

single test data point centered at the origin is sent into the previously grown forest for

prediction. Each tree provides a unique terminal node membership and statistic for this test

individual. The resulting forest ensemble statistic yields a class with the maximum frequency as

the predicted class label for this test data point. We note that the probability of all classes are

roughly equal. This is because the training data set has one case in each class at the origin. The

maximum class frequency for this example is yellow, though we can ascribe this to Monte Carlo

effects. When the x-variable space is densely covered with test data points, the predicted values

for the data space reveal the forest decision boundary. The boundary

is also shown in the lower half of the figure.

### Splitting and Node Size

It is important to understand how splitting and node size affect the topology of a tree. To aid in this,

some notation is introduced first. Assume $h$ is the node of a tree, and that the task is to split $h$ into

two daughter nodes. Assume there are `n`

cases in the node $h$. Some of these may be replicates if

bootstrapping is in effect. At the root node, `n = N`

. Let

there be `P`

x-variables

in the data set. These

can be real valued, discrete, or categorical. Let ${bf X}_i$ be the

`P`

-dimensional

feature for a case $i$ such

that ${bf X}_i = (X_{i1}, ldots, X_{i {tt P}})$. Let $x$ be

the observed value for the $p^{th}$ feature in the data set, where

$p in {1,ldots, {tt P}}$.

#### Real versus categorical splitting

If $x$ is continuous or discrete, a proposed split of $h$ on $x$ is

always of the form $x le c$ and $x > c$. Such a split defines left

and right daughter membership according to an individual’s value

$X_{ip}$. A deterministic split on $x$ requires considering all

unique values of $x$ in the node $h$.

If $x$ is categorical, denoting the split is more complex. For

computational coherence, categorical x-variables (and y-outcomes for that matter)

are treated as factors with an immutable 1:1 mapping of categories

to integers. Thus, an x-variable with $f$ categories (or levels) is

uniquely mapped to $f$ integers ${1, ldots, f}$. A split on a

factor with $f$ levels in the node $h$ requires dividing the levels

into two complementary subsets. Left and right daughter membership is

determined according to an individual’s level $X_{ip}$ and to which of

the two complementary subsets it belongs. A deterministic split on a

factor requires considering all possible complementary pairing of

levels. For example, a factor with six levels in a node has

[ C(6, 1) + C(6, 2) + C(6, 3) ] possible splits. These

possibilities can be considered to be composed of three groups. The

first group has one level going left, and five levels going right.

There are $C(6, 1)$ such splits. The second group has two levels

going left and four levels going right. There are $C(6, 2)$ such

splits. The third group has three levels going left and three levels

going right. There are $C(6, 3)$ such splits. These three groups

comprise the cardinal group count. In general, there are

$floor(f/2)$ cardinal groups. The total number

of possible splits (complementary pairings) is $2^{f-1} – 1 $.

Deterministic splitting on factors becomes problematic because of

the large number of permutations as the number of levels rise. There

is a practical limit on the number of levels a factor can have, within

our implementation. It is imposed by

`UNIT_MAX`

in `limits.h`

on the operating system in question. A

typical value for this is $ 2^{32} – 1 $ on most machines. Overrides are in

place to avoid deterministic splitting on large factors. The nature

of the override is dependent on whether the user has specified

`splitrule = "random"`

and/or `nsplit > 0`

. In general, we

try to limit the number of split attempts to be less than the node

size. By definition, this is always case when considering continuous

x-variables. That is, if a

node has $n’$ bootstrap cases, there will be at most $n’$ values on which to

split. To

encourage good splits, the algorithm attempts to emulate the same action on factors. In some

cases this will result in deterministic splitting. In others, it will

involve sampling from the $2^{f-1} – 1$ complementary pairings.

Whether factor or continuous x-variable, the

sampling is never greater than the nodesize regardless of what

value of `nsplit`

has been set.

Formulas for splits on continuous x-variables are more easily conveyed

to the reader because left and right daughter splits depend

on a simple inequality. For this reason, the split rules described in

the [Theory and Specifications] section on discrete and continuous

x-variables. However, the methodology is exactly applicable to factor

x-variables. In addition, for clarity the formulas described are for

splits at the root node, where $n = N$.

#### Deterministic versus random splitting

In contrast to deterministic splitting, a random version of any split rule

may be invoked using the parameter `nsplit`

. When

`nsplit = 0`

, deterministic splitting occurs and all possible split

points for each of the `mtry`

variables are considered. If `nsplit`

is set to a

non-zero positive integer, then a maximum of `nsplit`

split points are

randomly chosen for each of the `mtry`

variables within a node. The

split rule is applied to the randomly chosen split points, and the node is split

on that x-variable and split value yielding the best split

statistic as measured by the split rule. Pure random splitting can be invoked by setting

`splitrule="random"`

. In this case, for each node, a variable is randomly

selected and the node is split using one randomly chosen split point.

See [Cutler and Zhao, 2001] and [Lin and Jeon, 2006] for more details.

Trees tend to favor splits on continuous variables

[Loh and Shih, 1997], so it is good practice to use the `nsplit`

option when the data contains a mix of continuous and discrete

variables. Using a reasonably small value mitigates bias and may not

compromise accuracy. The value of `nsplit`

has a significant impact on the time

taken to grow a forest. When non-random splitting is in effect,

i.e. `nsplit = 0`

,

iterating over each split point can be

CPU intensive. However, when `nsplit > 0`

, or when pure random

splitting is in effect, CPU times are drastically reduced.

#### Node depth and node size

Node size and node depth are the most explicit means of controlling

the growth of a tree. The parameter `nodedepth`

is defined

as the number of connections between the node and the root node. It

is zero-based, thus defining the root node as having depth zero.

Setting `nodedepth`

limits the maximum depth to which a

tree can be grown. The default behaviour is to ignore this parameter

and not limit the depth of a tree. In contrast, `nodesize`

is always respected and is crucial to achieving a good model. When

`nodesize`

is set too large, the forest will be under-grow

and model accuracy will be compromised. When it is set too small, the

potential to split on noisy x-variables increases, the forest is

over-grown and model accuracy can actually deteriorate past a certain

threshold value for `nodesize`

. We define

`nodesize`

as the average number of bootstrap cases that

will be found in a terminal node in a tree in the forest. Splitting

can terminate well before this condition is reached if node purity is

detected before attempting to split a node. From the [Recursive Algorithm] it can be seen that

there are three conditions necessary for splitting a node:

- The current node depth must be less than the maximum node depth allowed.
- The current node size must be at least 2 times the node size specified.
- The current node must be impure.

Condition (2) assures us that if a split occurs on a node with 2 times

the node size specified, the average node size of the pair of daughters will be

`nodesize`

. If one daughter is less than

`nodesize`

, the other daughter will be greater than

`nodesize`

. But such terminal nodes will average to

`nodesize`

over the forest.

### Model Overview

This section is intended as

an overview to orient the reader to more high-level package details

concerning the [Five Models] produced

by the package. Each model requires a slightly different formula

specification, and uses model-specific split rules. This results in model-specific terminal node statistics,

ensembles, and a model-specific prediction error

algorithm. For completeness and rigour, the

mathematical details of all models are given in the [Theory and Specifications] section. Survival, Competing Risk, Regression,

Classification, Multivariate and Unsupervised split rules are

described, along with the terminal node estimators, and the prediction error for these

families. In addition, a short code example for each model is given.

In the [Formula Specification] table,

example grow calls for the five models are listed. The data sets used in

the two Survival function calls are available in

`randomForestSRC`

. The data sets used in the remaining

function calls are available from the base R installation. Survival

settings require a time and censoring variable which need to be

identified in the formula as the response using the standard

`Surv`

formula specification. Regression and

Classification settings emulate the formula specifications of the

`randomForest`

package. In the Multivariate and

Unsupervised case, there are variants for the formula, depending on

how many y-outcomes are to be specified as the response. These

variants are given in more detail in this section.

Family | Example Grow Call with Formula Specification |
---|---|

Survival | `rfsrc(Surv(time, status) ~ ., data=veteran)` |

Competing Risk | `rfsrc(Surv(time, status) ~ ., data=wihs)` |

Regression | `rfsrc(Ozone ~., data=airquality)` |

Classification | `rfsrc(Species ~., data=iris)` |

Multivariate | `rfsrc(Multivar(mpg, cyl) ~., data=mtcars)` |

Unsupervised | `rfsrc(Unsupervised() ~., data=mtcars)` |

In the [Split Rules] table, the rule in *italics* denotes the default

split rule for each model. The default split rule is

applied when the user does not specify a split rule. The package uses

the data set and formula specification to determine the model.

Survival and Competing Risk both have two split

rules. Regression has three flavours of split rules based on mean-squared

error. Classification also has three flavours of split rules based on the

Gini index. The Multivariate and Unsupervised split rules are a composite rule

based on Regression and Classification. Each component of the composite is normalized

so that the magnitude of any one y-outcome does not influence the

statistic. See the [Theory and Specifications] section for more details on the

mathematics of the split for each family.

Family | Split Rules |
---|---|

Survival | log-rank eqref{eqn:survival.logrank} |

log-rank score eqref{eqn:survival.logrank.score} | |

Competing Risk | log-rank modified weighted eqref{eqn:cr.modified.logrank} |

log-rank eqref{eqn:cr.logrank} | |

Regression | mean-squared error weighted eqref{eqn:regression.weighted} |

mean-squared error unweighted eqref{eqn:regression.unweighted} | |

mean-squared error heavy weighted eqref{eqn:regression.heavy.weighted} | |

Classification | Gini index weighted eqref{eqn:classification.weighted} |

Gini index unweighted eqref{eqn:classification.unweighted} | |

Gini index heavy weighted eqref{eqn:classification.heavy.weighted} | |

Multivariate | Composite mean-squared error |

Composite Gini index | |

Unsupervised | Composite mean-squared error |

Composite Gini index |

All models also allow the user to define a custom split rule

statistic. Some basic C-programming skills are required. Examples

for all the families reside in the C source code directory of

the package in the file `splitCustom.c`

. Note that

recompiling and re-installing the package is necessary after modifying

the source code.

In the table [Terminal Node

Statistics] (TNS),

the terminal node estimators produced by the five models are summarized. For Survival, the TNS is

the Kaplan-Meier estimator at the time points of interest specified by

the user, or as determined by the package if not specified.

Competing Risk has two TNS’s:

the cause-specific cumulative hazard estimate, and the cause-specific

cumulative incidence function. Regression and Classification TNS’s are

the mean and class proportions respectively. For a Multivariate model,

there are TNS’s for each response, whether it is real valued, discrete or

categorical. The Unsupervised model has no TNS, as the analysis is

responseless. See the [Theory and Specifications]
section for more details on the

mathematics of the TNS for each family.

Family | Terminal Node Statistics |
---|---|

Survival | Kaplan-Meier eqref{eqn:kaplan.meier} |

Competing Risk | cause-specific cumulative hazard eqref{eqn:cause.specific.chf} |

cause-specific incidence eqref{eqn:cause.specific.cif} | |

Regression | mean eqref{eqn:mean} |

Classification | class proportions eqref{eqn:class.proportions} |

Multivariate | mean per response eqref{eqn:mean} |

class proportions per response eqref{eqn:class.proportions} | |

Unsupervised | none |

In the table [Prediction Error],

the error rate calculation for the five

models is summarized. For Survival, it is based on Harrell’s concordance-index using the

cumulative hazard estimate as the values for comparison. For

Competing Risk, Harrell’s concordance-index uses the cause-specific

cumulative-hazard estimate. For Regression, performance is based on

mean-squared error. For classification, performance is based on the

conditional and over-all misclassification rate. For the Multivariate

and Unsupervised case, there is no prediction error implemented.

See the [Theory and Specifications]
section for more details on the

mathematics of the prediction error for each model.

Family | Prediction Error |
---|---|

Survival | Harrell’s C-index using cum-haz eqref{eqn:survival.mortality} and eqref{eqn:concordance} |

Competing Risk | Harrell’s C-index using cause-specific cum-haz eqref{eqn:cause.specific.mortality} and eqref{eqn:concordance} |

Regression | mean-squared error eqref{eqn:mean.squared.error} |

Classification | conditional and over-all misclassification rate eqref{eqn:misclassification.rate} and eqref{eqn:conditional.misclassification.rate} |

Multivariate | none |

Unsupervised | none |

We conclude this section by giving examples of a mixed multivariate

model, and an unsupervised model. The data set

`mtcars`

, available in the base distribution of R is

used. The usual outcome for the data set is `mpg`

(miles per

gallon). First, we modify the data set to force `carb`

(number

of carburetors) and `cyl`

(number of cylinders) to be categorical

values. We produce a mixed model where all three of these variable

are considered as outcomes. The code is given below:

```
# Multivariate mixed outcome: Motor Trend Car Road Tests. Miles per
# gallon is the usual response, but the data set is modified to
# introduce categorical and ordinal variables.
mtcars.mod
```

We also show various ways to implement an unsupervised model in the code

below. It is possible to choose one, or more pseudo-responses at each

split via the formula specification.

```
# Motor Trend Car Road Tests. Miles per gallon is the usual response,
# but it is ignored below and the data set is taken to be responseless.
mtcars.unspv
```

## Variable Selection

The package has two primary approaches for variable selection and

identifying interactions between variables. The function

`max.subtree()`

extracts maximal subtree information from a

forest [Ishwaran et al., 2010],

[Ishwaran et al., 2011]. The

function `vimp()`

extracts variable importance information from a

forest [Ishwaran et

al., 2007].

### Minimal Depth and Maximal Subtrees

Minimal depth is a dimensionless order statistic that measures the

predictiveness of a variable in a tree. It can be used to select

variables in high-dimensional problems. It assesses the predictiveness of a variable

by a depth calculation relative to the root node of a tree.

A maximal subtree for an x-variable $x$ is the largest subtree

whose root node splits on $x$ (i.e. all the parent nodes of

$x’s$ maximal subtree have nodes that split on variables other

than $x$). There can be more than one maximal subtree for a

variable. A maximal subtree will not exist if there are no splits

on the variable. The shortest

distance from the root of the tree to the root of the closest

maximal subtree of $x$ is the minimal depth of $x$. A smaller

value corresponds to a more predictive variable. The mean of the minimal depth distribution is used as the

threshold value for deciding whether an x-variable’s minimal depth value

is small enough for the x-variable to be classified as strong. An

important fact of minimal depth is that it does not depend on a

comparison of error rates, as does variable importance (described

later), but is only dependent on the topology of the forest.

See [Ishwaran et al., 2010] and

[Ishwaran et al., 2011] for more details.

The maximal subtrees for $x_1$ and the tree shown in the figure

labeled [Tree Decision Boundary] are

highlighted in blue in the figure labeled [Maximal Subtrees for $x_1$]. There are exactly two of them.

The minimal depth for $x_1$ is equal to one.

Examples of minimal depth and maximal subtrees follow:

```
# Survival Analysis with first and second order depths for all variables.
data(veteran, package = "randomForestSRC")
v.obj
```

### Variable Importance

The function `vimp()`

extracts variable importance (VIMP) information

[Ishwaran et

al., 2007] from a forest for a single x-variable

or group of x-variables. By default, VIMP is calculated for the

training data, but the user can

specify new test data for the VIMP calculation using the

parameter `newdata`

.

Action on the training data is as follows: The parameter

`importance`

allows four distinct ways to calculate VIMP. The

default value of `importance="permute"`

returns the Breiman-Cutler permutation VIMP as

described in [Breiman, 2001]. For each tree, the prediction error on

the OOB data is recorded. Then for a given x-variable

$x$, OOB cases are randomly permuted in $x$ and the

prediction error is recorded. The VIMP for $x$ for a tree is defined as the

difference between the perturbed and unperturbed error rate for that

tree. This is then averaged

over all trees. If `importance="random"`

is used, then OOB

cases are assigned a daughter node randomly

whenever a split on $x$ is encountered in the tree. Again, the VIMP for $x$ is defined as the

difference between the perturbed and unperturbed error rate averaged

over all trees. If the options `permute.ensemble`

or

`random.ensemble`

are used, then VIMP is calculated by comparing

the error rate for the perturbed OOB forest ensemble to the

unperturbed OOB forest ensemble. Unlike the Breiman-Cutler

measure, these last two VIMP ensembles do not measure the tree average

effect of perturbing

$x$, but rather the overall forest effect. See [Ishwaran et

al., 2008] for further details.

Joint VIMP is requested using `joint=TRUE`

. The joint VIMP is the

importance for a group of variables when the group is perturbed

simultaneously.

VIMP is completely dependent on prediction error. Although tempting, it is incorrect to

assume VIMP measures the change in prediction error for a forest grown with and without

a variable. For example, if two variables are highly correlated and both predictive,

each can have large VIMP values. Removing one variable and regrowing

the forest may affect the VIMP for the other variable (its value might get larger),

but prediction error will likely remain unchanged. VIMP measures the change in

prediction error on a fresh test case if $x$ were not available, given that the original

forest was grown using $x$. In practice, this often equals change in prediction

error for a forest grown with and without $x$, but conceptually the two quantities

are different. Examples follow:

```
# Classification example showcasing different vimp measures.
iris.obj
```

## Imputation

To handle data sets with missing data, the package provides

two interfaces to impute missing values. If a predictive model is desired,

then the user can create a forest with a call to `rfsrc()`

. If all

that is desired is a data set containing imputed values, without the

need for a model, then a call to `impute()`

will suffice. If no

formula is specified, unsupervised splitting is

implemented. The number of impute iterations is specified via the parameter `nimpute`

.

The number of impute iterations

corresponds to the number of forests grown.

When a predictive model is desired, only the final forest is retained.

Note that two impute iterations were found to be the minimum necessary to produce

reasonable imputed values.

On the first impute iteration, imputation is conducted in each internal

node as the tree is grown. This in-node imputation is used to determine left

and right daughter membership during the grow process. Note that

imputed values are never used in split statistics on the first impute

iteration. On the second, and subsequent

impute iterations, in-node imputation is absent. Instead, the forest

is grow as if there was no missing data, with missingness holes in the data

plugged by the imputed values determined by the previous impute

iteration. Imputation on the second and subsequent iteration only

occurs in the terminal nodes, after the forest has been grown.

The default setting for `rfsrc()`

is

`nimpute = 1`

. The default setting for

`impute()`

is `nimpute = 2`

.

Once a model has been created using `rfsrc()`

, test data with or

without missingness can be used with `predict.rfsrc()`

. Values in

the test data set are imputed based on the non-missing values in the training data

set. Internally,

both `rfsrc()`

and `impute.rfsrc()`

use variations of

the missing data algorithm in [Ishwaran et

al., 2008]. Setting

`na.action = "na.impute"`

imputes missing data (both x-variables and y-outcomes),

while `na.action = "na.omit"`

eliminates records with

missingness and then grows the forest. Examples follow:

```
# Examples of survival imputation.
# Create a model via imputation.
data(pbc, package = "randomForestSRC")
pbc.d
```

## Prediction

Predicted values are obtained by dropping test data down the forest

after the model has been created. The overall error rate is

returned if the test data contains a y-outcome.

If no test data is provided, then the original training data is used

and the code reverts to restoring the forest, associated terminal

node statistics and ensembles. This is useful in that it gives the user the

ability to extract outputs from the forest that were not asked for

during model creation. The figure [Forest

Decision Boundary] is an example of how prediction can be used to view the forest decision boundary imposed on

the x-variables. Examples of traning, testing and restoration follow:

Training/testing scenario:

```
data(veteran, package = "randomForestSRC")
# Sample the data and create a training subset.
train
```

Restoration scenario:

```
# Train the model.
airq.obj
```

### outcome = « test »

If `outcome = "test"`

, the training data is not used to calculate the

terminal nodes statistics. Instead, the TNS are recalculated

using the y-outcomes from the test set. This yields a modified

predictor in which the invariant topology of the forest is based on the

training data, but where the ensembles and predicted

values are based on the test data. The

terminal node statistics,

ensembles and error rates are all calculated by bootstrapping the test

data and using OOB individuals to ensure unbiased estimates.

Consider the the forest depicted previously in [Forest

Decision Boundary] and the associated forest ensemble for the single test

point at the origin. This original training data and ensemble are shown

in the top left and top right of [outcome = « test »]. New test data, formed by reducing

the radii of the blue and yellow classes is shown in the

bottom left of the figure. The origin

contains only two data points, namely one red and one green point.

Using the new test data to populate terminal nodes, results in an ensemble that is missing the blue and yellow classes.

This ensemble is shown in the bottom right of the figure.

The function for creating the spheres is

given in the section [Supplementary Code], and the higher level code used to produce the [outcome = « test »] figure follows:

```
## Four circular regions with equal class counts, touching at the origin.
## We suppress the X3 dimension in order to get a circle from a sphere.
circles
```

### Pruning

There are a number of ways to limit the growth of trees in a model.

The most obvious are the parameters `nodedepth`

and

`nodesize`

discussed in the section [Node Depth and Node Size.

Decreasing `nodedepth`

or increasing

`nodesize`

will result in shallower trees. After a forest has been

created, the user is

able to prune trees back using the

parameter `ptn.count`

. This parameter represents the

pseudo-terminal node count. It allows us to specify the desired number of

pseudo-terminal nodes after a tree has been grown. This is not

useful for Random Forests analysis per se, but it can be useful in

other applications. Trees are flexible and adaptive nonparametric

estimators and, as such, they represent ideal weak learners for

implementing gradient boosting

[Friedman, 2001]. Boosted

trees for regression and classification, where exactly J-terminal nodes

(for any integer value of J) are needed, is easily accomplished using

the parameter `ptn.count`

. This particular application of the

`randomForestSRC`

package is incorporated into the

`boostmtree`

package on CRAN

[Ishwaran and Kogalur, 2016]. Pruning is accomplished by deleting

terminal nodes from the maximum depth back toward the root until the desired

pseudo-terminal node count is achieved. Daughter nodes are deleted in

pairs. This results in a parent node becoming a pseudo-terminal node,

after the daughter terminal nodes have been deleted at the current

maximum depth.

## Hybrid Parallel Processing

This package has the capabilities of non-trivial parallel

processing with massive scalability. Growing a forest has a recursive

component as was shown in the [Recursive Algorithm]. But it is also an

iterative process that is repeated `ntree`

times. This fact is

depicted in the figure [Iterative Process]. The entry point is a

user-defined R script that initiates the grow call

`rfsrc()`

. This function pre-processes the data set, and calls

a grow function in the package’s C library, which grows the

forest. The C function grows a single tree, adds the resulting

tree-specific ensemble(s) to the forest ensemble(s), and continues this

process for `ntree`

iterations. The C function then

returns the forest ensemble(s) to the R function, which

post-processes the results. It is advantageous to parallelize this

iterative process. Such situations are often called « pleasingly

parallel ». The default pre-compiled binaries, available on CRAN,

should execute in parallel. If not, the package has been written to be

easily compiled to accommodate

parallel execution. As a prerequisite, the host architecture,

operating system, and installed compilers must support it. OpenMP and

Open MPI compilers must be available on the host

system. If this

is the case, it is possible to compile the source code to enable parallel processing.

A graphic

showing one mode of parallel execution (OpenMP) on a typical desktop/laptop

computer is shown in the figure [Shared

Memory Parallel Processing]. In this example, the

hardware on the left of the figure is a Macbook Pro (2015 model) with two quad-core CPUs,

and 16GB of RAM. This hardware

allows eight threads to execute simultaneously. The memory is

shared among all cores, and this configuration is typical of symmetric

multiprocessing (SMP) where two or more identical processors connect

to a shared memory bank. Most reasonably configured multi-core

notebooks and desktops are ideal hosts for `randomForestSRC`

and OpenMP parallel

execution. The right side of [Shared

Memory Parallel Processing] shows the software

implementation of the OpenMP model. The R code reads the

data set, pre-processes the data set, calls the

C code and requests a forest of

`ntree`

trees. When OpenMP parallel execution is in force, the

C code grows trees simultaneously, rather than iteratively.

Each core is tasked with growing a tree. Once a core completes growing a tree, the

forest ensemble values are incremented with the contribution of that tree,

and the core is re-tasked with growing another tree. This continues

until the entire forest is grown. Control is then returned to the

R code which does post-processing of the results. On the hardware

shown in the figure, it is possible to

grow eight trees at a time. Under ideal conditions,

the elapsed time to grow a forest should decrease linearly and be

close to a factor of eight.

In order to give the user some flexibility, core usage can be

controlled via R options. First, the user needs to determine the number of cores on

the machine. This can be done within an R session by

loading the `parallel`

package and issuing the command

`detectCores()`

. It is possible to set the numbers of cores

accessed during OpenMP parallel execution at

the start of every R session by issuing the command

`options(rf.cores = x)`

, where `x`

is the number of

cores. If `x`

is a negative number, the package will access

the maximum number of cores on the machine. The `options`

command can

also be placed in the users `.Rprofile`

file for convenience.

Alternatively, one can initialize the environment variable `RF_CORES`

in the user’s command shell environment.

Gains in performance on a desktop (or a single SMP node) are limited by the number of cores

available on the hardware. The OpenMP parallel model shown in [Shared

Memory Parallel Processing] that relies on an underlying

iterative process

was chosen for its suitability in

accommodating shared memory systems and its ease of implementation. Model

creation is computationally intense and the OpenMP parallel processing

model allows us to take full advantage of the hardware available.

Message Passing Interface (MPI), an alternative model

to OpenMP is suitable for distributed memory systems, in which the

number of cores available can be much greater than that of a desktop

computer. It is possible to run

`randomForestSRC`

on appropriately designed and configured

clusters, allowing for massive scalability. An ideal

cluster for the package is made up of tightly connected SMP nodes.

Each SMP node has its own memory, and does not share

this memory with other nodes. Nodes, however, do communicate with

each other using MPI.

In the figure [MPI/OpenMP Parallel Processing], a generic cluster with four SMP nodes is

depicted. Each node can be considered to be similar to a desktop

computer. However, in a cluster setting, each node is tightly

supervised and controlled by a top-level layer of cluster software

that handles job scheduling, and resource management. On the left of the

figure, four nodes are connected via a network that facilitates MPI

communications across nodes. The right of the figure depicts the software

implementation of hybrid parallel processing using the

`randomForestSRC`

package. A primary/replica framework is chosen whereby a primary node on

the cluster is tasked with growing the forest. The primary node

divides the forest into sub-forests. If the size of the forest is

`ntree`

, then each replica in this example is tasked with growing a

sub-forest of size `ntree/4`

. Hybrid parallel processing is

dependent on the `Rmpi`

package. This package must be available.

If it is not, it must compiled and installed from the source code available on CRAN.

The `Rmpi`

is a mature package and provides flexible low level

functionality. Other explicitly parallel general-use packages on CRAN such as `snow`

,

`snowfall`

, and `foreach`

provide higher level user friendly

functions that, while potentially useful, were considered not as

flexible and customizable in a cluster environment.

In the hybrid example

that follows, `Rmpi`

is loaded on

the primary node, and MPI processes are then spawned on each

replica. Disregarding the fact that the replicas can communicate with the

primary using MPI, each replica acts like an SMP node in [Shared

Memory Parallel Processing]. A replica initiates concurrent OpenMP threads across it

cores, and grows its sub-forest using shared memory OpenMP parallel

processing. It is possible to grow many more trees

concurrently on a cluster, because the forest is spread among many more

cores than are available on a single node, or desktop computer. In

[MPI/OpenMP Parallel Processing],

the forest uses 32 cores concurrently. In an ideal

environment, the elapsed time to grow the original forest will be

reduced by a factor of 32. After all replicas have completed growing their

sub-forests, they return their data to the primary process, via MPI messages,

and the replicas terminate. On a single SMP node (equivalent to a desktop computer), a thread on a core

starts, grows a tree, terminates, and repeats this process until all trees

have been grown. A core is either waiting for a task,

working on a task, or has completed a task. A task, in this case, is

the growing of a tree. On a cluster, a replica node starts, grows a

sub-forest, and terminates. It repeats this process until all

sub-forests requested by the primary node have been grown. A replica is either waiting for a

sub-forest, working on a sub-forest, or has completed a sub-forest.

As proof of concept, and for benchmarking purposes we ran the package

on two clusters available to us. The first cluster was the Learner

Research Institute High Performance Cluster (LRI-HPC) at the Cleveland

Clinic. This system runs Simple Linux Utility for Resource Management

(SLURM) supervisor software. It is a Linux-based cluster consisting

of twenty nodes with ten cores per node. Each node has 64GB of shared

memory. The second cluster was the

[Pegasus 2] installation at the High

Performance Computing Group at the Center for Computational Science

(HPC CCS) at the University of Miami. This system has 10,000 cores

spread across over 600 nodes. The system also runs a flavour of Linux

but uses Platform LSF as the job scheduler. The different job schedulers

for the two clusters required minor customization of the calling

shell scripts. For our benchmarks, we analyzed a simulated survival data

set with `n=3000`

, `p=30`

, and `ntree=1024`

, with the

default value of `importance="permute"`

. We ran the package in

serial mode and scaled up to 1024 cores across 64 nodes. The results

are shown in this figure. In serial mode, growing 1024

trees iteratively (using one core) took about 330 minutes. The red

points all reflect OpenMP parallel processing, on one node. These

results would be similar to that of a desktop computer. Pegasus 2 has

16 cores per node. When using all cores on one node were used,

elapsed time decreased to about 25 minutes. This reflects a decrease

in a factor of 13 from the serial time. The theoretical maximum

decrease would be closer to a factor of 16. The results would be

similar to that of running the analysis on desktop computer with 16 cores. Next,

hybrid parallel processing using the primary/replica framework was

implemented. Two replica nodes were tasked, then three, and so on up to

64 nodes. At 64 nodes, total core usage was 1024. At this number,

the cluster was growing all 1024 trees simultaneously. The

data points reflecting hybrid parallel processing are in blue. Linear

decreases in elapsed time were observed. This indicates that the

package is capable of massive scalability. The primary and replica

code harnesses are given in the section [Supplementary Code].

## Theory and Specifications

### Survival

There are two split rules that can be used to grow survival trees.

In this model, the response or y-outcome associated with individual (i) is a pair of

values specifying a non-negative survival time and censoring

information. Denote the response for (i) by (Y_i = (T_i,

delta_i)). The individual is said to be right censored at time (T_i)

if (delta_i = 0) and is said to have died at time (T_i) if (delta_i

= 1). In the case of death, (T_i) will be referred to as an event

time, and the death as an event. An individual (i) who is right

censored at (T_i) simply means that the individual is known to have been

alive at (T_i), but the exact time of death is unknown.

#### Log-rank splitting

The log-rank test for splitting survival trees is a well established

concept [Segal, 1988]. It has been shown to be robust

in both proportional and non-proportional hazard settings

[Leblanc and Crowley, 1993]
Note that

a proposed split of (h) on a real value (x) is of the form (x le c) and (x >

c). Such a split defines left and right daughter membership. The

split that maximizes survival differences between the two daughter

nodes is the best split. Let (t_1 c},

end{equation*}

where (x_i) is the value of (x) for individual (i=1, ldots, n).

Define (Y_k=Y_{k,l}+Y_{k,r}) and (d_k=d_{k,l} + d_{k,r}). Let

(n_s) be the total number of observations in daughter (s), Thus,

(n=n_l+n_r), where (n_l=#{i: x_ile c}) and

(n_r=#{i: x_i> c}).

The log-rank test for a split at the value (c) for an x-variable (x)

is

begin{equation}

L(x,c)=frac{{displaystylesum_{k=1}^mleft(d_{k,l}-Y_{k,l}frac{d_k}{Y_k}right)}}

{sqrt{{displaystylesum_{k=1}^mfrac{Y_{k,l}}{Y_k}left(1-frac{Y_{k,l}}{Y_k}right)

left(frac{Y_k-d_k}{Y_k-1}right)d_k}}}.

label{eqn:survival.logrank}

end{equation}

The value (|L(x,c)|) is a measure of node separation. Larger

values of (|L(x,c)|) imply a greater the difference between the two

groups, and the better the split. In particular, the best split at

node (h) is determined by finding the x-variable (x^*) and split value

(c^*) such that (|L(x^*,c^*)|ge |L(x,c)|) for all (x) and (c).

#### Log-rank score splitting

The log-rank score test for splitting survival trees is described in

[Hothorn and Lausen (2003)]. In this rule, assume the x-variable

(x) has been ordered so that (x_1 le x_2 le cdots le x_n). Now,

compute the « ranks » for each survival time (T_j) where (j in {1,

ldots, n}). Thus,

begin{equation*}

a_j=delta_j-sum_{k=1}^{Gamma_j}frac{delta_k}{n-Gamma_k+1} ,

end{equation*}

where (Gamma_k=#{t:T_tle T_k}). Note that (Gamma_k) is the

index of the order for (T_k). The log-rank score test is defined as

begin{equation}

S(x,c)=frac{sum_{x_kle c}left(a_j-n_loverline{a}right)}

{sqrt{{displaystyle n_lleft(1-frac{n_l}{n}right)s_a^2}}} ,

label{eqn:survival.logrank.score}

end{equation}

where (overline{a}) and (s_a^2) are the sample mean and sample variance of

({a_j: j=1, ldots, n}). Log-rank score splitting defines the measure

of node separation by (|S(x,c)|). Maximizing this value over (x)

and (c) yields the best split.

#### Terminal node estimators

The survival estimate associated with a terminal node is provided by

the Kaplan-Meier (KM) estimator [Kaplan and Meier (1958)].

Again let (t_1

An example of a survival forest follows:

```
# Veteran's Administration Lung Cancer Trial. Randomized
# trial of two treatment regimens for lung cancer.
data(veteran, package = "randomForestSRC")
v.obj
```

#### Prediction error

Prediction error for survival models is measured by (1-C), where

(C) is Harrell’s [] concordance

index. Prediction error is between 0 and 1, and measures how well the

predictor correctly ranks two random individuals in terms of survival.

Unlike other measures of survival performance,

Harrell’s C-index does not depend on choosing a fixed time for

evaluation of the model and specifically takes into account censoring

of individuals [May et al., (2004)]
The method has quickly become quite popular in the

literature as a means for assessing prediction performance in survival

analysis settings. See [Kattan et

al., 1988] and references therein.

To compute the concordance index, it is necessary to define what constitutes a

worse predicted outcome. Let

(t_1^*,ldots,t_M^*) denote all unique times in the data.

Define the predicted outcome for individual (i) to be

begin{equation}

mathcal{M}_i = sum_{k=1}^{M}hat{H}_e^*(t_k^*|{bf X}_i).

label{eqn:survival.mortality}

end{equation}

Individual (i) is said to have a worse predicted outcome than

individual (j) if (mathcal{M}_i > mathcal{M}_j)

The concordance error rate is computed as follows:

- Form all possible pairs of observations over all the data.
- Omit those pairs where the shorter event time is censored.

Also, omit pairs ((i, j)) if (T_i=T_j) unless ((delta_i=1, delta_j=0)) or

((delta_i=0, delta_j=1)) or ((delta_i=1, delta_j=1)). The last restriction

only allows ties in event times if at least one of the observations is a death.

Let the resulting pairs be denoted by (mathbb{S}). Let (permissible =

|mathbb{S}|). - If (T_i ne T_j), count 1 for each (s in mathbb{S}) in which the shorter time

had the worse predicted outcome. - If (T_i ne T_j), count 0.5 for each (s in mathbb{S}) in which

(mathcal{M}_i = mathcal{M}_j). - If (T_i = T_j), count 1 for each (s in mathbb{S}) in which

(mathcal{M}_i = mathcal{M}_j). - If (T_i = T_j), count 0.5 for each (s in mathbb{S})

in which (mathcal{M}_i ne mathcal{M}_j). - Let (concordance) denote the resulting count over all

permissible pairs. Define the concordance index (C) as

begin{equation}

C=dfrac{concordance}{permissible}.

label{eqn:concordance}

end{equation} - The error rate is (Err=1-C). Note that (0 le Err le 1) and that

(Err=0.5) corresponds to a procedure doing no better than random

guessing, whereas (Err=0) indicates perfect prediction.

### Competing Risk

Competing risks occur in survival analyses when the occurrence of one

event precludes the occurrence of the remaining set of possible

events. Individuals subject to competing risks are observed from study

entry to the occurrence of the event of interest. This is a competing event

though often, before the individual can experience one of the events, that

person is right censored.

Formally, let (T_i^o) be the event time for

the (i^{th}) subject, (i=1,ldots,N), and let (delta_i^o) be the event type,

(delta_i^oin{1,ldots,J}), where (J ge 1). Let (C_i^o) denote the

censoring time for individual (i) such that the actual time of event

(T_i^o) is unobserved and one only observes (T_i=min(T_i^o,C_i^o))

and the event indicator (delta_i=delta_i^o I(T_i^ole C_i^o)). When

(delta_i=0), the individual is said to be censored at (T_i), otherwise if

(delta_i=j>0), the individual is said to have an event of type (j) at

time (T_i). Denote the observed response for (i) as (Y_i = (T_i,delta_i)).

The cause-specific hazard function (cs-H) for event (j) given

a (P)-dimensional feature ({bf X}_i) is

begin{equation}

alpha_j(t|{bf X}_i) = lim_{Delta trightarrow 0} frac{mathbb{P}{tle T^o le t+Delta t,

delta^o=j | T^oge t,{bf X}_i}}{Delta t} = frac{f_j(t|{bf X}_i)}{S(t|{bf X}_i)} .

label{eqn:csh}

end{equation}

Here (S(t|{bf X}_i)=mathbb{P}{T^oge t|{bf X}_i}) is the event-free survival

probability function given ({bf X}_i). The cs-H

describes the instantaneous risk of event (j) for subjects that

currently are event-free. The probability of an event is

determined using the cause-specific cumulative incidence function (cs-CIF), defined as the

probability of experiencing an event of type (j) by time (t);

i.e. (F_j(t|{bf X}_i) = mathbb{P}{T^o le t, delta^o=j|{bf X}_i}). The cs-CIF and cs-H

are related according to

begin{equation}

F_j(t|{bf X}_i)

=int_0^t S(s-|{bf X}_i)alpha_j(s|{bf X}_i), mathrm d s

= int_0^t expleft(-int_0^s sum_{l=1}^J alpha_l(u|{bf X}_i)

mathrm d uright) alpha_j(s|{bf X}_i), mathrm d s .

label{eqn:cif}

end{equation}

See [Gray (1988)] for an analysis of the cs-CIF.

With these two equations, it is possible to describe the two split

rules that can be used to grow competing risk trees.

Again, for notational convenience these rules are described for the root node

using the entire learning data, and for splits on real valued

x-variables. However, the algorithm

extends to any tree node, to bootstrap data, and to splits

on categorical x-variables.

As before, let (T_i,delta_i)_{1le i le n}) denote the pairs of survival times

and event indicators. Let (t_1

two daughter nodes containing two new sets of competing risk data. The

subscripts (l) and (r) refer to the left and

right daughter node, respectively, and denote (alpha_{jl}

(alpha_{jr}

eqref{eqn:csh}, in the left and

the right daughter node. Similarly define (F_{jl}

to be the cs-CIF, given by eqref{eqn:cif}, for the left and the right daughter node.

The number of individuals at risk at time (t) in the left and right

daughter nodes are (Y_l

begin{equation*}

Y_l

Y_r

end{equation*}

and (x_i) is the value the x-variable assumes for individual (i=1,ldots,n).

The number of type (j) events at time (t) for

the left and right daughters is

begin{equation*}

d_{j,l}

d_{j,r}

end{equation*}

The total number of individuals who are risk at time (t), and

the total number of type (j) events at time (t) in the parent node are

given by

begin{eqnarray}

Y

label{eqn:event.count} \

d_j

label{eqn:at.risk}

end{eqnarray}

Define also (t_m,t_{m_l},t_{m_r}) to be the largest time

on study in the parent node and the two daughters, respectively.

#### Log-rank splitting

The first competing risk split rule is the log-rank test. This is a test of the null hypothesis

(H_0:alpha_{jl}

The test is based on the weighted difference of the cause-specific

Nelson-Aalen estimates in the two daughter nodes. Specifically, for a

split at the value (c) for variable (x), the split rule is

begin{equation*}

L^{mathrm{LR}}_j(x,c)=frac 1{hatsigma^{mathrm{LR}}_{j}(x,c)}sum_{k=1}^m

W_j(t_k)left(d_{j,l}(t_k)

-frac{d_j(t_k)Y_l(t_k)}{Y(t_k)}right) ,

end{equation*}

where the variance estimate is given by

begin{equation*}

(hatsigma^{mathrm{LR}}_{j}(x,c))^{2}= sum_{k=1}^m W_j(t_k)^2 d_j(t_k)

frac{Y_l(t_k)}{Y(t_k)} left(1-frac{Y_l(t_k)}{Y(t_k)}right)

left(frac{Y(t_k)-d_j(t_k)}{Y(t_k)-1}right) .

end{equation*}

Time-dependent weights (W_j

sensitive to early or late differences between the cause-specific

hazards. The choice (W_j

test which has optimal power for detecting alternatives where the

cause-specific hazards are proportional. In practice, (W_j

W_j). That is, the weights are constant and not time-dependent.

Furthermore, the cause-specific split rules across the

event types are combined to arrive at the composite log-rank competing split rule

that is implemented in the package and given by

begin{equation}

L^{mathrm{LR}}(x,c) = frac{sum_{j=1}^J

(hatsigma^{mathrm{LR}}_j(x,c))^2

L_j^{mathrm{LR}}(x,c)}{sqrt{sum_{j=1}^J(hatsigma^{mathrm{LR}}_j(x,c))^2}} .

label{eqn:cr.logrank}

end{equation}

The best split is found by

maximizing (|L^{mathrm{LR}}(x,c)|) over (x) and (c).

#### Modified log-rank splitting

The cause-(j) specific log-rank split rule eqref{eqn:cr.logrank} is

useful if the main purpose is to detect variables that affect the

cause-(j) specific hazard. It may not be optimal if the purpose is

also prediction of cumulative event probabilities. In this case better

results may be obtained with split rules that select variables

based on their direct effect on the cumulative incidence. For this

reason the second split rule is modeled after Gray’s

test [Gray (1988)], which tests the null hypothesis

(H_0:F_{jl}

See [Ishwaran et al. (2014)] for additional details in using the

modified risk set defined by

begin{equation*}

Y^*_j

end{equation*}

This equals the number of individuals who have not had an event prior

to (t) in addition to those individuals who have experienced an event

(j^{‘} ne j) prior to (t), but who are not censored.

The split rule based

on the score statistic which uses the modified risk sets, is denoted

(L^{mathrm{G}}_j(x,c)) and given by substituting (Y^*_j) for (Y_j)

and (Y^*_{jl}) for (Y_l) in eqref{eqn:cr.logrank}. Note that if the

censoring time is not known for those cases that have an event before

the end of follow-up, the largest observed time is used, and the

statistic (L^{mathrm{G}}_j(x,c)) is still a good (and computationally

efficient) approximation of Gray’s test statistic, see

[Section 3.2, Fine, Jason P and Gray,

Robert J (1999)].

Thus the composite modified log-rank competing split rule is given by

begin{equation}

L^{mathrm{G}}(x,c) = frac{sum_{j=1}^J

(hatsigma^{mathrm{G}}_j(x,c))^2

L_j^{mathrm{G}}(x,c)}{sqrt{sum_{j=1}^J(hatsigma^{mathrm{G}}_j(x,c))^2}} .

label{eqn:cr.modified.logrank}

end{equation}

The best split is found by

maximizing (|L^{mathrm{G}}(x,c)|) over (x) and (c).

#### Terminal node estimators

Let (t_1 Equation 2.3, Gray (1988)]. Namely,

begin{equation}

hat{F}_j(t|{bf X}_i)

=sum_{t_{k} expected number of life years lost and represents a measure of

mortality. This is also called the cause-(j) mortality, is

denoted by (mathcal{M}_j) and given by

begin{equation}

hat{mathcal{M}}_j = int_{0}^{tau} hat{F}_j(s) ds = sum_{t_{k+1} le

tau}hat{F}_j(t_k)[t_{k+1} – t_k] .

label{eqn:cause.specific.mortality}

end{equation}

#### Prediction error

The cause-specific error rate for each event (j) is estimated using the

concordance index protocol resulting in equation eqref{eqn:concordance}. Let

(hat{mathcal{M}}_{i,j}) denote the cause-(j) mortality for case (i). We

say that case (i) has a higher risk of event (j) than case (i^{‘}) if

(hat{mathcal{M}}_{i,j} > hat{mathcal{M}}_{i^{‘},j}). Applying

equation eqref{eqn:concordance}

results in a concordance index (C_j) and an error rate (Err_j = 1 –

C_j).

An example of a competing risk call follows:

```
# Women's Interagency HIV Study (WIHS). Competing risk
# data set involving AIDS in women.
data(wihs, package = "randomForestSRC")
wihs.obj
```

### Regression

There are three split rules that can be used to grow regression trees.

In this model, the y-outcome associated with an individual (i) is a single real

value. Denote the response for (i) by (Y_i in mathbb{R}).

#### Weighted variance splitting

Suppose the proposed split for the root node is of the

form (x le c) and (x>c) for a continuous x-variable (x), and a split

value of (c). The split rule is

begin{equation}

theta(x,c) = frac{n_l}{n} times frac{1}{n_l} sum_{x_i le c} (Y_i – bar{Y}_l)^2 + frac{n_r}{n} times

frac{1}{n_r} sum_{x_i > c} (Y_i – bar{Y}_r)^2 ,

label{eqn:regression.weighted}

end{equation}

where the subscripts (l) and (r) indicate left and right daughter

node membership respectively. Then

begin{equation*}

bar{Y}_l = frac{1}{n_l} sum_{x_i le c} Y_i , hskip10pt bar{Y}_r = frac{1}{n_r} sum_{x_i > c} Y_i

end{equation*}

are the means within each of the daughters, and (n_l) and (n_r) are

the number of cases in the left and right daughter respectively such that ((n = n_l + n_r).) The goal is to

find (x) and (c) to minimize (theta(x,c)).

#### Unweighted variance splitting

In this rule, it is necessary to minimize

begin{equation}

theta(x,c) = frac{1}{n_l} sum_{x_i le c} (Y_i – bar{Y}_l)^2 +

frac{1}{n_r} sum_{x_i > c} (Y_i – bar{Y}_r)^2 .

label{eqn:regression.unweighted}

end{equation}

#### Heavy weighted variance splitting

In this rule, it is necessary to minimize

begin{equation}

theta(x,c) = left( frac{n_l}{n} right)^2 times frac{1}{n_l}

sum_{x_i le c} (Y_i – bar{Y}_l)^2 + left( frac{n_r}{n} right)^2 times

frac{1}{n_r} sum_{x_i > c} (Y_i – bar{Y}_r)^2 .

label{eqn:regression.heavy.weighted}

end{equation}

#### Terminal node estimators

The predicted value for a terminal node is the mean value of the cases

within that node. Let individual (i) have feature ({bf X}_i). and reside in terminal

node (h). The terminal node predicted value for a node (h) is given by

begin{equation}

hat{f}_h = frac{1}{n_h} sum_{{bf X}_i in h} Y_i .

label{eqn:mean}

end{equation}

To produce the OOB predicted value for individual (i) this quantity is averaged

over

all (tt ntree) trees. Let (hat{f}_b({bf X}_i)) denote the predicted value

for tree (b in {1,ldots,tt tree}).

As before, (I_{i,b}=1) if individual (i) is an

OOB individual for (b in {1, ldots, tt ntree }). Otherwise set (I_{i,b}=0).

The OOB predicted value for individual (i) is

begin{equation*}

hat{f}_e^{*}({bf X}_i) = frac{sum_{b=1}^{tt ntree}{I_{i,b} hat{f}_b ({bf X}_i)}}{sum_{b=1}^{tt ntree} I_{i,b}}.

end{equation*}

#### Prediction error

The estimated prediction error is given by

begin{equation}

Err = frac{1}{n} sum_{i=1}^{N} (Y_i – hat{f}_e^{*}({bf X}_i))^2 .

label{eqn:mean.squared.error}

end{equation}

An example of a regression forest follows:

```
# New York air quality measurements. Mean ozone in parts per billion.
airq.obj
```

### Classification

There are three split rules that can be used to grow classification trees.

In this model, the y-outcome associated with an individual (i) is a

single single categorical value. Let ({bf p} = (p_1, ldots, p_J)) be the class proportions

for the classes (1) through (J) respectively, for the y-outcome in the node.

#### Weighted Gini index splitting

Suppose the proposed split for the root node is of the

form (x le c) and (x>c) for a continuous x-variable (x), and a split

value of (c). The impurity of the node is defined as

begin{equation*}

phi({bf p}) = sum_{j = 1}^{J} p_j(1 – p_j) = 1 – sum_{j = 1}^{J}

p_j^2 .

end{equation*}

The Gini index for a split (c) on (x) is

begin{equation*}

theta(x,c) = frac{n_l}{n} phi({bf p}_l) + frac{n_r}{n} phi({bf

p}_r) ,

end{equation*}

where, as before, the subscripts (l) and (r) indicate left and right daughter

node membership respectively, and (n_l) and (n_r) are

the number of cases in the daughters such that ((n = n_l + n_r).) The goal is to find (x) and (c) to minimize

begin{equation}

theta(x,c) = frac{n_l}{n} left( 1 – sum_{j = 1}^{J}

left( frac{n_{j,l}}{n_l} right)^2 right) + frac{n_r}{n} left( 1 – sum_{j =

1}^{J} left( frac{n_{j,r}}{n_r} right)^2 right) ,

end{equation}

where (n_{j,l}) and (n_{j,r}) are

the number of cases of class (j) in the left and right daughter,

respectively, such that ((n_j = n_{j,l} + n_{j,r}).) This is

equivalent to maximizing

begin{equation}

theta^*(x,c) = frac{1}{n} sum_{j = 1}^{J} frac{n_{j,l}^2}{n_l} +

frac{1}{n} sum_{j = 1}^{J} frac{(n_j – n_{j,l})^2}{n – n_l},

label{eqn:classification.weighted}

end{equation}

#### Unweighted Gini index splitting

Unweighted Gini index splitting corresponds to

begin{equation*}

theta(x,c) = phi({bf p}_l) + phi({bf p}_r) .

end{equation*}

The best split is found by minimizing

begin{equation*}

theta(x,c) = left( 1 – sum_{j = 1}^{J} left( frac{n_{j,l}}{n_l} right)^2 right)

+ left( 1 – sum_{j = 1}^{J} left( frac{n_{j,r}}{n_r}

right)^2 right) .

end{equation*}

This is equivalent to maximizing

begin{equation}

theta^*(x, c) = sum_{j = 1}^{J} left( frac{n_{j,l}}{n_l} right)^2 +

sum_{j = 1}^{J} left( frac{n_j – n_{j,l}}{n – n_l} right)^2 .

label{eqn:classification.unweighted}

end{equation}

#### Heavy weighted Gini index splitting

Heavy weighted Gini index splitting corresponds to

begin{equation*}

theta(x,c) = frac{n_l^2}{n^2} phi({bf p}_l) + frac{n_r^2}{n^2}

phi({bf p}_r) .

end{equation*}

The best split is found by minimizing

begin{equation*}

theta(x,c) = frac{n_l^2}{n^2} left( 1 – sum_{j = 1}^{J} left( frac{n_{j,l}}{n_l} right)^2 right)

+ frac{n_r^2}{n^2} left( 1 – sum_{j = 1}^{J} left(

frac{n_{j,r}}{n_r} right)^2 right) .

end{equation*}

This is equivalent to maximizing

begin{equation}

theta^*(x, c) = frac{1}{n^2} sum_{j = 1}^{J} n_{j,l}^2 +

frac{1}{n^2} sum_{j = 1}^{J} (n_j – n_{j,l})^2 – frac{n_l^2}{n^2} –

frac{n_r^2}{n^2} + 2 .

label{eqn:classification.heavy.weighted}

end{equation}

Note that the above is non-negative due to the addition of the constant last term.

#### Terminal node estimators

The predicted value for a terminal node (h) are the class proportions

in the node.

Let individual (i) have feature ({bf X}_i). and reside in terminal

node (h). Let (n_j) equal the number of bootstrap cases of class (j) in

the node. Then the proportion for the (j^{th}) class is given by

begin{equation}

hat{p}_{j,h} = frac{1}{n_h} sum_{{bf X}_i in h} I {Y_i = j } .

label{eqn:class.proportions}

end{equation}

To produce the OOB predicted value for individual (i) this is averaged

over

all (tt ntree) trees. Let (hat{p}_{j,b}({bf X}_i)) denote the predicted value

for the (j^{th}) proportion for tree (b in {1,ldots,tt ntree}).

As before, (I_{i,b}=1) if individual (i) is an

OOB individual for (b in {1, ldots, tt tree}), otherwise set (I_{i,b}=0).

The OOB predicted value for the (j^{th}) proportion for individual (i) is

begin{equation*}

hat{p}_{e,j}^{*}({bf X}_i) = frac{sum_{b=1}^{tt ntree}{I_{i,b}

hat{p}_{j,b} ({bf X}_i)}}{sum_{b=1}^{tt tree} I_{i,b}} .

end{equation*}

#### Prediction error

Consider (hat{bf p}_{e}^{*} ({bf X}_i) = (hat{p}_{e,1}^{*}({bf X}_i), ldots,

hat{p}_{e,J}^{*}({bf X}_i))). Let

begin{equation*}

hat{p}_{e,max}^{*}({bf X}_i) = max_{1 le j le J}

(hat{p}_{e,j}^{*}({bf X}_i)) ,

end{equation*}

and let (N_j^{*}) equal the number of individuals with

class equal to (j) in the data set. The estimated conditional misclassification rate is given by

begin{equation}

Err_j = frac{1}{N_j^{*}} sum_{i:Y_i = j} (Y_i ne

hat{p}_{e,max}^{*}({bf X}_i)) .

label{eqn:misclassification.rate}

end{equation}

The estimated over all misclassification rate is given by

begin{equation}

Err = frac{1}{N} sum_{i=1}^{N} (Y_i ne hat{p}_{e,max}^{*}({bf

X}_i)) .

label{eqn:conditional.misclassification.rate}

end{equation}

An example of a classification forest follows:

```
# Edgar Anderson's iris data with morphologic variation of three
# related species.
iris.obj
```

### Multivariate

In the multivariate and mixed model, the y-outcomes associated with an individual (i) can be

denoted by ({bf Y}_i = ((Y_{i,1}, ldots, Y_{i,r})) where (Y_{i,j}) is real

or categorical for each (j in {1, ldots, r }).

When (Y_{i,j}) is real for all (j in {1, ldots, r }) the model is

a multivariate regression forest. In this case the split rule statistic

is a composite of (r) split rule statistics based on

[Weighted Variance

Splitting]. The outcomes in the node are normalized to ensure that a particular

(j)-specific statistic does not overwhelm the composite statistic.

When (Y_{i,j}) is categorical for all (j in {1, ldots, r }) the

model is a multivariate classification forest. In this case the split rule statistic

is a composite of (r) split rule statistics based on

[Weighted Gini Index

Splitting]. No

normalization is necessary since the (j)-specific statistics

represent an impurity measure which is invariant with respect to scale.

In the general case where (Y_{i,j}) is real or categorical for each (j

in {1, ldots, r }) the model is a multivariate mixed forest. In this case the split rule statistic

is a composite of (r) split rule statistics based on

[Weighted Variance

Splitting] and [Weighted Gini Index

Splitting]. Only those (j)-specific

statistics representing real outcomes are normalized.

#### Terminal node estimators

There are (j) sets of terminal node estimators specific to the real or categorical

cases previously described.

#### Prediction error

There is no prediction error implemented in this scenario.

### Unsupervised

Unsupervised models are appropriate in settings where there is no

y-outcome. In the unsupervised split rule, a subset of the x-variables (the size of which is specified

by the formula) is chosen as the pseudo

y-outcomes at each node, and the multivariate split rule in the [Multivariate] section is applied. Thus, each (tt mtry)

attempt results in a multivariate mixed split rule. Note importantly

that all terminal node statistics and error rates are turned off under

unsupervised splitting.

#### Terminal node estimators

There are no terminal node estimators implemented in this scenario.

#### Prediction error

There is no prediction error implemented in this scenario.

## Supplementary Code

Generalized code for simulating the spheres used in [Tree Decision Boundary] and elsewhere:

```
``` Generalized code for simulating the survival data set used in

[Benchmark of Hybrid Parallel Processing]:

```
```

The primary harness used in [Benchmark of Hybrid

Parallel Processinge]:

```
``` The replica harness used in

[Benchmark of Hybrid Parallel Processinge]:

```
```## References

```
```

Source de l'article

```
```

```
```