Notes on Decision Tree Learning #2: Random Forest

Notes on Trees:

Decision Trees
1. One line summary
2. Entropy, Info gain, Gini Coefficient, Gini gain
3. High Variance, tendency to overfit
Random Forest
1. One line summary / Pros of RF
2. Random Forest construction
3. Error rate
4. Feature (variable) importance
Boosted Trees (discrete adaboost)
1. Combining weak classifiers to create a strong classifier
2. Algorithm
3. Outlier detection

Random Forest

2.1 One line summary

As the name suggests, Random Forest, is a collection of decision trees which represents the forest, with the details of each tree constructed randomly.

The result of ensembling the randomly generated tree classifiers in a reduction of variance, and improvement in generalization relative to a single decision tree.

Pros of Random Forest [1]:

It is unexcelled in accuracy among current algorithms.
It runs efficiently on large data bases.
It can handle thousands of input variables without variable deletion.
It gives estimates of what variables are important in the classification.
It generates an internal unbiased estimate of the generalization error as the forest building progresses.
It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
It has methods for balancing error in class population unbalanced data sets.
Generated forests can be saved for future use on other data.
Prototypes are computed that give information about the relation between the variables and the classification.
It computes proximities between pairs of cases that can be used in clustering, locating outliers, or (by scaling) give interesting views of the data.
The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.
It offers an experimental method for detecting variable interactions.

2.2 Random Forest construction

Randomness is present at a number of points in the construction of a forest. The sample set, feature set, depth of tree, can all be set using a stochastic process. If a random process is used when selecting which feature to use for splitting, as opposed to using gini gain or information gain, we result in the ExtraTrees learner, or extremely randomized trees.

When constructing the learner, the boostrap aggregating, also known as bagging, is applied to the tree learners. Given a training set \(X = x_1, …, x_n\) with responses \(Y = y_1, …, y_n\), bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples. [2]

For the classification task, a majority vote is taken over the set of tree learners.

Using numpy we can implement the majority vote function as so:

# Class method
def majority_vote(self, samples):
    """ Method to compute majority vote over a give set of tree classifiers.
    Args:
        samples - a set of samples for classification.
    Returns:
        final_predictions - prediction based on majority vote.
    """
    # Get classification based on each tree.
    predictions = np.array([tree.classify(samples) for tree in self.trees])
    # For each sample, use majority vote for classification.
    final_predictions = [
        np.argmax(np.bincount(predictions.T[sample]))
        for sample in range(n_samples)
    ]
    return final_predictions

2.3 Error rate

The error rate of the forest is dependent on how accurate the individual classifiers are and the dependence between them. [3] Put another way, the strength of each classifier and the correlation between said classifiers contributes to the error rate.

Breiman introduced an upperbound using strength and correlation as inputs: \[generalization_error \leq \frac{\rho (1-s^2)}{s^2}\]

An indepth analysis can be found in A Study of Strength and Correlation in Random Forests. [4]

2.4 Feature (variable) importance

Random forests are used to get estimates of importance of the features. sklearn offers a easy way to obtain the importance of features. See a [5] for a practical example. Breiman’s original paper provides a more detailed analysis of the topic.

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)

References:

https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
https://en.wikipedia.org/wiki/Random_forest
https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf
https://hal.archives-ouvertes.fr/hal-00598466/document