100% found this document useful (1 vote)
107 views

Handout9 Trees Bagging Boosting

This document provides an overview of decision trees, bagging, boosting and their applications in machine learning. It discusses decision trees, ensemble methods like bagging and boosting, random forests, AdaBoost and gradient boosting algorithms. Key advantages of ensemble methods are their ability to fit complex data and cancel out weaknesses of individual models. Decision trees are prone to overfitting but bagging and boosting help address this.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
107 views

Handout9 Trees Bagging Boosting

This document provides an overview of decision trees, bagging, boosting and their applications in machine learning. It discusses decision trees, ensemble methods like bagging and boosting, random forests, AdaBoost and gradient boosting algorithms. Key advantages of ensemble methods are their ability to fit complex data and cancel out weaknesses of individual models. Decision trees are prone to overfitting but bagging and boosting help address this.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

FIE453 – Big Data with

Applications to Finance
Trees, Bagging, Boosting

Who? Walt Pohl


From? Norwegian School of Economics
Department of Finance

When? September 26, 2022


Decision Trees
Decision trees are perhaps the easiest
supervised learning technique.
Pick a variable. Split the data into two
groups along the variable.
Within each group, pick another variable,
and split the data into two subgroups.
etc.
Once you have a bunch of groups, use the
average over each group as your prediction.
You can picture this as as tree.
Example: Titanic Survival

(sibsp is the number of siblings or spouses


also on the boat.)
Loss and Regularization

You choose the splits to minimize the loss


function.
To regularize, you penalize based on the
size of the tree. (Very important – decision
trees are very prone to overfitting.)
This is hard to do exactly, so algorithms
use heuristics.
Pros and Cons of Decision
Trees
Pros:
Can in principle fit arbitrary nonlinear
relationship.
Very easy to interpret output.
Cons:
An arbitrary nonlinear relationship can
require an arbitrarily large tree.
Very sensitive to outliers. One observation
can change an early split, which makes all
of the later splits completely different.
Ensemble Methods
Ensemble methods are methods to
combine results from multiple supervised
learning techniques into one.
Weaknesses in individual methods can
cancel out when combined.
The way individual techniques are
combined depends on the type of problem.
For example:
Regression – average together predictions.
Classification – majority vote.
Types of Ensemble Methods

We consider two main types of ensemble


methods:
Bagging
Boosting
Decision trees are frequently used as the
raw methods to combine. We refer to each
individual method as a “learner”.
Bagging

Bootstrap aggregation (bagging) is a


bootstrap-based method. It is a very
simple idea.
Start with N different learners.
For each learner, choose a random sample
from the data, and apply the learner.
Combine the results.
Random Forests
A refined version of this idea applied to
decision trees is called a random forest.
The model consists of a series of trees.
Each individual tree is fitted as follows:
For each split point, randomly throw out
some of the variables and some of the data.
Choose the split on that subset of the data.
Each new split point, rerandomize.
Any mistakes you make from randomizing
get washed out when you average the
trees.
Regularizing Random Forests

Random forests are mostly self-regularizing.


The fact that you are always working with
a subset of the data helps prevent
overfitting.
Averaging the models cancels out
accidental overfits, even if you use big
trees.
Boosting
Boosting results from a famous theorem in
computer science. A weak learner is a
learner that is only slightly better than
random guessing. Is there was an
algorithm to turn a weak learner into a
strong learner?
Yes. Any technique to do so is known as
boosting.
Bagging reduces variance (by averaging).
Boosting reduces bias.
Boosting, cont’d

How does boosting accomplish this trick?


Start with a learner.
Look at the mistakes that learner makes.
Make a new learner that diagnoses the
mistakes of the first learner.
Look at the mistakes that the first two
learners make.
Repeat.
AdaBoost
AdaBoost is a simple classification
algorithm to turn weak learners into strong
learners by iteratively reweighing the
sample.
So each subsequent learner pays more
attention to the cases that were hard to
classify for the first learner.
The weak learners can even be “decision
stumps” – decision trees with only one
split.
AdaBoost Algorithm
Each observation has a weight, as does
each learner. The observation weights start
as equal.
The algorithms works as follows:
1 Fit the m-th learner on the predictions it makes.
2 Compute how many mistakes it makes.
Observations with more weight count for more.
3 Choose the weight of the learner so that a learner
which makes few mistakes has a high weight.
4 Update the observation weights so that the
observations it has trouble with count for more.
The final predictor is the weighted sum of the
individual learners.
Why Does AdaBoost Work?

The steps in the algorithm do two things:


Data points that are hard to classify are
given more weight.
Learners that do well count for more.
Gradient Boosting

Gradient boosting (or gradient boosting


machine) is the leading general-purpose
machine learning algorithm.
For example, most winning entries on
Kaggle – a machine learning competition
site – use gradient boosting.
AdaBoost is restricted to classification.
Gradient boosting is suitable for combining
arbitrary learners.
Gradient Boosting

The idea is simple:


Use the first learner to predict the data.
Compute the “residuals” from the first
learner, and use the second learner to
predict the residuals.
etc.
Output of gradient boosting

You fit a sequence of (small) decision trees:


The first tree fits the data.
The second, third, etc. tree fit the error
from the previous tree.
The final model combines all of the trees,
like in AdaBoost.
Putting the “gradient” in
gradient boosting

The challenge is to define a “residual” for


each possible loss function.
Gradient boosting uses the gradient of the
loss function on the data – this points in
the direction of best improvement.
(For regression, this is exactly the residuals
returned by R.)
Gradient Boosting Algorithm

Gradient boosting can be viewed as a form


of gradient descent.
You have a measure of predictive accuracy,
called the loss.
Each step you compute the gradient of the
loss, and fit a tree to that gradient.
You then take a step in the direction of the
tree.
Final model is a sum of tree predictions.
Regularization of Boosting
Boosting can overfit.
Each individual tree tends to be small
(gradient boosted trees use stumps by
default).
At the same time, each new tree tries to fit
more and more hard-to-match aspects of
the data.
The main way to regularize is to cut back
on the number of trees.
This makes regularization easy to
implement – just look at what happens
when you throw out the later trees.
Boosting Implementations

There are several gradient boosting


implementations, which are worth
comparing:
gbm – original implementation.
xgboost – most high-profile
implementation.
lightgbm – by Microsoft.
catboost – Specialized to classification
problems. Originally by Yandex.
Pros and Cons of Bagging and
Boosting

Pros: Fit complex data much better.


Cons: While we can understand individual
trees, the whole model is an opque black
box.

You might also like