Notes on Trees:
- Decision
Trees
- One line summary
- Entropy, Info gain, Gini Coefficient, Gini gain
- High Variance, tendency to overfit
- Random Forest
- One line summary / Pros of RF
- Random Forest construction
- Error rate
- Feature (variable) importance
- Boosted Trees (discrete adaboost)
- Combining weak classifiers to create a strong classifier
- Algorithm
- Outlier detection
Decision Trees
1.1 One line summary
Decision trees is a learning algorithm that uses features to split observations along branches to obtain a conclusion in the leaf nodes (nodes with no children), where a conclusion is a classification of the input sample.
1.2 Entropy, Information gain, Gini Coefficient, Gini gain
Given a sample with many features, the task is to find the feature, or combination of features, that results in an architecture that classifies the features efficiently.
We can use the information gain measure used by ID3, C4.5, C5.0 algorithms, or gini impurity to find available features that will result in the most effective split at each given node. The general idea is that we want to find features that increase the information gained per node. We want to move from a state of confusion to clarity, high entropy to low entropy, and said one final way, from a heterogeneous state to a homogeneous state.
Information Gain
Calculating the information gain algorithmically is a nice to way to introduce the concept. In order to understand information gain, the concept of entropy needs to be introduced first.
\[entropy(p_1…p_n) = -\sum_{i=1}^n p_i\log_2(p_i)\]
(e.g. considering a binary classification task, if the number of \(0\) and number of \(1\) is equal the entropy is maximized where as if there are only \(0\) or only \(1\), then entropy is minimized. We can quicky confirm this.
import numpy as np
def entropy(x):
n = float(len(x))
n_ones = np.count_nonzero(x)
# Need to handle case log2(0) = -inf
if n_ones == 0:
return 0
else:
entropy = (
-(n_ones/n)*np.log2(n_ones/n) - (1-n_ones/n)*np.log2(1-n_ones/n)
)
return entropy
x1 = np.array([0,0,0,0,0,0,0,0,0,0])
x2 = np.array([0,0,0,0,0,1,1,1,1,1])
print("Entropy of homogenous classification: {}".format(entropy(x1)))
print("Entropy of even split classification: {}".format(entropy(x2)))
Entropy of homogenous classification: 0
Entropy of even split classification: 1.0
Entropy is also intepreted as the number of bits needed to explain the classification. (e.g. when the classification is evenly split, we need 1 bit to explain 2 classes.) [1]
- Calculate entropy for parent node. The higher the entropy the worse it is.
- For each feature:
- calculate the entropy for children nodes after splitting on the selected feature.
- calculate the weighted sum of entropies across children nodes.
- calculate information gain as \(entropy(parent) - \sum_i entropy(child_i)\)
- Select feature with highest information gain
- Repeat 1-3 until all nodes are leafs (1 class), or a max depth limit is reached.
Gini Gain
The gini gain is a similar concept to information gain, in that it provides an evaluation metric to select features that create the best split. Gini is used in the CART (Classification and Regression Trees) algorithm.
Impurity, or Gini Coefficient, is computed using below equation: \[impurity(p_1..p_n) = 1- \sum_{i=1}^n p_i^2\]
def impurity(x):
n = float(len(x))
n_ones = np.count_nonzero(x)
impurity = (
1-((n_ones/n)**2 + (1-n_ones/n)**2)
)
return impurity
print("Impurity of homogenous classification: {}".format(impurity(x1)))
print("Impurity of even split classification: {}".format(impurity(x2)))
Impurity of homogenous classification: 0.0
Impurity of even split classification: 0.5
We can quickly observe that the metric outputs similar results to the entropy measure. Gini gain can be computed in the same way as was done for information gain.
1.3 High variance, tendency to overfit
Decision trees are often associated with high variance and the tendency to overfit to the training data. Given a deep enough tree, and nodes, a decision tree can perfectly fit the training data, but fail on validation and testing, failing to generalize. Tree based algorithms introduced later, address the high variance attribute of decision trees.
References:
- http://www.cs.bc.edu/~alvarez/ML/id3
- http://web.cs.ucdavis.edu/~vemuri/classes/ecs271/Decision%20Trees-Construction.htm