#### Notes on Trees:

- Decision Trees
- One line summary
- Entropy, Info gain, Gini Coefficient, Gini gain
- High Variance, tendency to overfit

- Random Forest
- One line summary / Pros of RF
- Random Forest construction
- Error rate
- Feature (variable) importance

- Boosted Trees (discrete adaboost)
- Combining weak classifiers to create a strong classifier
- Algorithm
- Outlier detection

### Random Forest

#### 2.1 One line summary

As the name suggests, Random Forest, is a collection of decision trees which represents the forest, with the details of each tree constructed randomly.

The result of ensembling the randomly generated tree classifiers in a reduction of variance, and improvement in generalization relative to a single decision tree.

**Pros of Random Forest** [1]:

- It is unexcelled in accuracy among current algorithms.
- It runs efficiently on large data bases.
- It can handle thousands of input variables without variable deletion.
- It gives estimates of what variables are important in the classification.
- It generates an internal unbiased estimate of the generalization error as the forest building progresses.
- It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
- It has methods for balancing error in class population unbalanced data sets.
- Generated forests can be saved for future use on other data.
- Prototypes are computed that give information about the relation between the variables and the classification.
- It computes proximities between pairs of cases that can be used in clustering, locating outliers, or (by scaling) give interesting views of the data.
- The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.
- It offers an experimental method for detecting variable interactions.

#### 2.2 Random Forest construction

Randomness is present at a number of points in the construction of a forest. The
sample set, feature set, depth of tree, can all be set using a stochastic
process. If a random process is used when selecting which feature to use for
splitting, as opposed to using gini gain or information gain, we result in the
** ExtraTrees** learner, or extremely randomized trees.

When constructing the learner, the ** boostrap aggregating**, also known as
bagging, is applied to the tree learners.
Given a training set \(X = x_1, …, x_n\) with responses \(Y = y_1, …, y_n\),
bagging repeatedly (B times) selects a random sample with replacement of the
training set and fits trees to these samples. [2]

For the classification task, a majority vote is taken over the set of tree learners.

Using numpy we can implement the majority vote function as so:

```
# Class method
def majority_vote(self, samples):
""" Method to compute majority vote over a give set of tree classifiers.
Args:
samples - a set of samples for classification.
Returns:
final_predictions - prediction based on majority vote.
"""
# Get classification based on each tree.
predictions = np.array([tree.classify(samples) for tree in self.trees])
# For each sample, use majority vote for classification.
final_predictions = [
np.argmax(np.bincount(predictions.T[sample]))
for sample in range(n_samples)
]
return final_predictions
```

#### 2.3 Error rate

The error rate of the forest is dependent on how accurate the individual classifiers are and the dependence between them. [3] Put another way, the strength of each classifier and the correlation between said classifiers contributes to the error rate.

Breiman introduced an upperbound using strength and correlation as inputs: \[generalization_error \leq \frac{\rho (1-s^2)}{s^2}\]

An indepth analysis can be found in ** A Study of Strength and Correlation in
Random Forests**. [4]

#### 2.4 Feature (variable) importance

Random forests are used to get estimates of importance of the features. sklearn offers a easy way to obtain the importance of features. See a [5] for a practical example. Breiman’s original paper provides a more detailed analysis of the topic.

```
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)
```

### References:

- https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
- https://en.wikipedia.org/wiki/Random_forest
- https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf
- https://hal.archives-ouvertes.fr/hal-00598466/document