A compilation of concepts I want to remember...

» Home
» Github

# A Review of XGBoost (eXtreme Gradient Boosting)

14 Dec 2017

Gradient boosting is an ensemble technique, where prediction is done by an ensemble of simple estimators. Realistically, gradient boosting can be done over various estimators but in practice GBDT is used where gradient boosting is over decision trees. Instead of heterogeneous grouping of estimators, the grouping is homogeneous and consists of a variation of decision trees with different parameter settings. The tree is built greedily, thus the algorithm is fast but at a cost. The greedy strategy often times results in sub-optimal solutions. [1]

The decision tree ensemble is a set of classification and regression trees (CART). In contrast to standard decision trees where the leaf contains decisions, the leaves found in CART are associated with real value scores which provides a richer interpretation that go beyond classification. [3] The ensemble can be mathematically represented as follows:

$\sum_{k=1}^K D_k$

The algorithm is trained by iteratively adding a decision tree trained to reduce the residual of the target function and the current ensemble of trees, $$D(x)$$. At each iteration the hope is that the residual, the remaining error, gets successively smaller. Though not realistic for a complex function, if we were able to train a tree against $$R(x)$$ where the residual was truly zero, then the trained ensemble would have fit the distribution completely and perfectly.

$R(x) = f(x) – D(x)$ $where D(x) = tree_1(x) + tree_2(x) + …$

### Objective function and Optimization

Gradient boosting relies on regression trees, where the optimization step works to reduce RMSE, while for binary classification the standard log loss, $$-\frac{1}{N}\sum_{i=1}^Ny_i log(p_i) + (1+y_i) log(1-p_i)$$ is used. For a multi-class classification problem the cross entropy loss is the input to the objective function to be optimized.

Combining the loss function with a regularization term arrives at the objective function. The regularization term controls the complexity and reduces the risk of over-fitting. [2] XGBoost uses gradient descent for optimization improving the predictive accuracy at each optimization step by following the negative of the gradient as we are trying to find the “sink” in a n-dimensional plane. In XGBoost, the regularization term is defined as:

$\Omega(f) = \gamma T + \frac{1}{2}\lambda\sum_{j=1}^Tw_j^{2}$

### Parameters

1. Learning rate, $$\eta$$, is a factor that is applied to each individual tree prior to adding to the ensemble. In practice, a small learning rate is preferred as large learning rates results in larger steps which increases the frequency of discontinuities. A suggested learning rate range for grid search is a range between $$0.01 < \eta < 0.1$$.

2. Number of trees, and depth of each tree. Typically a larger number of trees is preferred with a depth that is not excessive. As seen in the visualization [1], we can see that an ensemble of deep trees can result in relatively lower residuals, but at the cost of increased noise/discontinues in the surface of the function approximation.

### XGBoost Parameters (Going Deeper):

The authors of XGBoost have divided the parameters into three categories, general parameters, booster parameters, and learning task parameters. I have highlighted the majority of parameters to be considered when tuning parameters. For the full list refer to the documentation. [5]

#### General Parameters: parameters to defined the overall functionality.

1. $$\textbf{booster}$$ (default = gbtree): can select the type of model (gbtree or gblinear) to run at each iteration.

2. $$\textbf{silent}$$ (default = 0): If set to one, silent mode is set and the modeler will not receive any feedback after each iteration.

3. $$\textbf{nthread}$$ (default = max # of threads): used to set the number of cores to use for processing.

#### Booster Parameters: Two types of boosters, tree and linear, but as tree typically outperforms linear, the tree boosters is only considered in most literature.

1. $$\textbf{eta}$$ (default = 0.3): Learning rate used to shrink weights on each step. Typical final values fall in between 0.01 ~ 0.2. [4] Note this differs from the recommendation from [1] suggesting the learning rate range best set between 0.01 ~ 0.1.

2. $$\textbf{min_child_weight}$$ (default = 1): Used to control overfitting and defines the minimum sum of weights of all observations required in a child. A larger number restricts models ability to learn finer details of training set.

3. $$\textbf{max_depth}$$ (default = 6): Typical values 3-10, defines the maximum depth of a tree.

4. $$\textbf{max_leaf_nodes}$$: the number of terminal nodes of leaves in a tree. Has a mathematical relationship with depth of tree.

5. $$\textbf{gamma}$$ (default = 0): specifies the minimum loss reduction required to make a split.
6. $$\textbf{subsample}$$ (default = 1): Defines the fraction of observations to be used when sampling randomly from each tree. Typically values use range between 0.5 – 1.
7. $$\textbf{colsample_bytree}$$ (default = 1): Fraction of columns to be used when random sampling for tree build out.
8. $$\textbf{lambda}$$ (default = 1): L2 regularization term.
9. $$\textbf{alpha}$$ (default = 0): L1 regularization term.
10. $$\textbf{scale_pos_weight}$$ (default = 1): A value greater than 0 should be used in case of high class imbalances.

#### Learning Task Parameters: used to define the optimization objective.

1. $$\textbf{objective}$$ (default = reg:linear): defines the loss function to be minimized. Options include binary:logistic, multi:softmax, multi:softprob.
2. $$\textbf{eval_metric}$$: Metric used for validation. Options include rmse(default for regression), mae, logloss, error (default for classification), merror, mlogloss, auc.