# Notes on Trees:

- Decision Trees
- One line summary
- Entropy, Info gain, Gini Coefficient, Gini gain
- High Variance, tendency to overfit

- Random Forest
- One line summary / Pros of RF
- Random Forest construction
- Error rate
- Feature (variable) importance

- Boosted Trees (discrete adaboost)
- Combining weak classifiers to create a strong classifier
- Algorithm
- Outlier detection

### Boosting

#### 3.1 One line summary

A general algorithm that has the ability to improve the accuracy of a learning algorithm. The following notes will discuss boosting in the context of tree based learners. Boosting allows for the ensemble of weak classifiers to produce a strong classifier, where weak and strong has references to the accuracy of the classifier.

#### 3.2 Algorithm

The general idea is to generate an ensemble of trees sequentially, where each iteration of tree creation focuses on samples that were misclassified in the tree generation. This is done, by increasing the weight of the misclassified sample, increasing the probability of being sampled in the next tree generation round.

As a result of the sequential nature, parallelization is difficult in contrast to Random Forest, which is often considered easily parallized. Further, a benefit of boosting, is that the algorithm has the ability to not only reduce variance, but also bias, where as improvements from Random Forest is concentrated in reducing variance.

** Indeed, it has been proved that AdaBoost’s combined classifier has an error
rate
that converges to the Bayes optimal provided that the algorithm is given enough
data, that it is run for enough but not too many rounds, and that the weak
hypothe-
ses come from a class of functions that is “sufficiently rich.** [2]

The algorithm as introduced in *** Intro to Boosting *** [1]:

Given: \((x_1, y_1) … (x_m, y_m)\) where \(x_i \in X, y_i \in Y = {-1, +1}\)

Initialize: \(D_1(i) = \frac {1}{m}\)

For \(t=1..T:\)

- Train weak learner using distribution \(D_t\)
- Get weak hypothesis \(h_t: X \rightarrow {-1,+1}\) with error: \(\epsilon_t = P_{i~D_t}[h_t(x_i) \neq y_i]\)
- Choose \(\alpha_t = 0.5 \ln(\frac{1-\epsilon_t}{\epsilon_t})\)
- Update: \(D_{t+1}(i) = \frac{D_t(i)exp(-\alpha_ty_ih_t(x_i))}{Z_t}\)

Finally once \(T\) trees produced: \(H(x) = sign(\sum_{t=1}^T\alpha_th_t(x))\)

Note that the influence of the weak classifier \(h_t(x)\) is given by \(\alpha\) which is a function of the error or frequency of misclassification by the given tree. The lower the \(\epsilon\) the higher the influence.

#### 3.3 Outlier detection

As the algorithm iteratively assigns higher weights to misclassified samples,
the algorithm has the ability to detect outliers. ** Adaboost focuses its weight
on the hardest examples, the examples with highest weights often turn to be
outliers.**[1]

### References:

- https://cseweb.ucsd.edu/~yfreund/papers/IntroToBoosting.pdf
- http://rob.schapire.net/papers/explaining-adaboost.pdf