Notes on Decision Tree Learning #1: Decision Trees

Notes on Trees:

Decision Trees
1. One line summary
2. Entropy, Info gain, Gini Coefficient, Gini gain
3. High Variance, tendency to overfit
Random Forest
1. One line summary / Pros of RF
2. Random Forest construction
3. Error rate
4. Feature (variable) importance
Boosted Trees (discrete adaboost)
1. Combining weak classifiers to create a strong classifier
2. Algorithm
3. Outlier detection

Decision Trees

1.1 One line summary

Decision trees is a learning algorithm that uses features to split observations along branches to obtain a conclusion in the leaf nodes (nodes with no children), where a conclusion is a classification of the input sample.

1.2 Entropy, Information gain, Gini Coefficient, Gini gain

Given a sample with many features, the task is to find the feature, or combination of features, that results in an architecture that classifies the features efficiently.

We can use the information gain measure used by ID3, C4.5, C5.0 algorithms, or gini impurity to find available features that will result in the most effective split at each given node. The general idea is that we want to find features that increase the information gained per node. We want to move from a state of confusion to clarity, high entropy to low entropy, and said one final way, from a heterogeneous state to a homogeneous state.

Information Gain

Calculating the information gain algorithmically is a nice to way to introduce the concept. In order to understand information gain, the concept of entropy needs to be introduced first.

\[entropy(p_1…p_n) = -\sum_{i=1}^n p_i\log_2(p_i)\]

(e.g. considering a binary classification task, if the number of \(0\) and number of \(1\) is equal the entropy is maximized where as if there are only \(0\) or only \(1\), then entropy is minimized. We can quicky confirm this.

import numpy as np
def entropy(x):
    n = float(len(x))
    n_ones = np.count_nonzero(x)
    # Need to handle case log2(0) = -inf
    if n_ones == 0:
        return 0
    else: 
        entropy = (
            -(n_ones/n)*np.log2(n_ones/n) - (1-n_ones/n)*np.log2(1-n_ones/n)
        )    
        return entropy

x1 = np.array([0,0,0,0,0,0,0,0,0,0])
x2 = np.array([0,0,0,0,0,1,1,1,1,1])

print("Entropy of homogenous classification: {}".format(entropy(x1)))
print("Entropy of even split classification: {}".format(entropy(x2)))

Entropy of homogenous classification: 0
Entropy of even split classification: 1.0

Entropy is also intepreted as the number of bits needed to explain the classification. (e.g. when the classification is evenly split, we need 1 bit to explain 2 classes.) [1]

Calculate entropy for parent node. The higher the entropy the worse it is.
For each feature:
- calculate the entropy for children nodes after splitting on the selected feature.
- calculate the weighted sum of entropies across children nodes.
- calculate information gain as \(entropy(parent) - \sum_i entropy(child_i)\)
Select feature with highest information gain
Repeat 1-3 until all nodes are leafs (1 class), or a max depth limit is reached.

Gini Gain

The gini gain is a similar concept to information gain, in that it provides an evaluation metric to select features that create the best split. Gini is used in the CART (Classification and Regression Trees) algorithm.

Impurity, or Gini Coefficient, is computed using below equation: \[impurity(p_1..p_n) = 1- \sum_{i=1}^n p_i^2\]

def impurity(x):
    n = float(len(x))
    n_ones = np.count_nonzero(x)
    impurity = (
        1-((n_ones/n)**2 + (1-n_ones/n)**2)
    )    
    return impurity

print("Impurity of homogenous classification: {}".format(impurity(x1)))
print("Impurity of even split classification: {}".format(impurity(x2)))

Impurity of homogenous classification: 0.0
Impurity of even split classification: 0.5

We can quickly observe that the metric outputs similar results to the entropy measure. Gini gain can be computed in the same way as was done for information gain.

1.3 High variance, tendency to overfit

Decision trees are often associated with high variance and the tendency to overfit to the training data. Given a deep enough tree, and nodes, a decision tree can perfectly fit the training data, but fail on validation and testing, failing to generalize. Tree based algorithms introduced later, address the high variance attribute of decision trees.

References:

http://www.cs.bc.edu/~alvarez/ML/id3
http://web.cs.ucdavis.edu/~vemuri/classes/ecs271/Decision%20Trees-Construction.htm