This document provides an overview of decision trees, bagging, boosting and their applications in machine learning. It discusses decision trees, ensemble methods like bagging and boosting, random forests, AdaBoost and gradient boosting algorithms. Key advantages of ensemble methods are their ability to fit complex data and cancel out weaknesses of individual models. Decision trees are prone to overfitting but bagging and boosting help address this.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100%(1)100% found this document useful (1 vote)
107 views
Handout9 Trees Bagging Boosting
This document provides an overview of decision trees, bagging, boosting and their applications in machine learning. It discusses decision trees, ensemble methods like bagging and boosting, random forests, AdaBoost and gradient boosting algorithms. Key advantages of ensemble methods are their ability to fit complex data and cancel out weaknesses of individual models. Decision trees are prone to overfitting but bagging and boosting help address this.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23
FIE453 – Big Data with
Applications to Finance Trees, Bagging, Boosting
Who? Walt Pohl
From? Norwegian School of Economics Department of Finance
When? September 26, 2022
Decision Trees Decision trees are perhaps the easiest supervised learning technique. Pick a variable. Split the data into two groups along the variable. Within each group, pick another variable, and split the data into two subgroups. etc. Once you have a bunch of groups, use the average over each group as your prediction. You can picture this as as tree. Example: Titanic Survival
(sibsp is the number of siblings or spouses
also on the boat.) Loss and Regularization
You choose the splits to minimize the loss
function. To regularize, you penalize based on the size of the tree. (Very important – decision trees are very prone to overfitting.) This is hard to do exactly, so algorithms use heuristics. Pros and Cons of Decision Trees Pros: Can in principle fit arbitrary nonlinear relationship. Very easy to interpret output. Cons: An arbitrary nonlinear relationship can require an arbitrarily large tree. Very sensitive to outliers. One observation can change an early split, which makes all of the later splits completely different. Ensemble Methods Ensemble methods are methods to combine results from multiple supervised learning techniques into one. Weaknesses in individual methods can cancel out when combined. The way individual techniques are combined depends on the type of problem. For example: Regression – average together predictions. Classification – majority vote. Types of Ensemble Methods
We consider two main types of ensemble
methods: Bagging Boosting Decision trees are frequently used as the raw methods to combine. We refer to each individual method as a “learner”. Bagging
Bootstrap aggregation (bagging) is a
bootstrap-based method. It is a very simple idea. Start with N different learners. For each learner, choose a random sample from the data, and apply the learner. Combine the results. Random Forests A refined version of this idea applied to decision trees is called a random forest. The model consists of a series of trees. Each individual tree is fitted as follows: For each split point, randomly throw out some of the variables and some of the data. Choose the split on that subset of the data. Each new split point, rerandomize. Any mistakes you make from randomizing get washed out when you average the trees. Regularizing Random Forests
Random forests are mostly self-regularizing.
The fact that you are always working with a subset of the data helps prevent overfitting. Averaging the models cancels out accidental overfits, even if you use big trees. Boosting Boosting results from a famous theorem in computer science. A weak learner is a learner that is only slightly better than random guessing. Is there was an algorithm to turn a weak learner into a strong learner? Yes. Any technique to do so is known as boosting. Bagging reduces variance (by averaging). Boosting reduces bias. Boosting, cont’d
How does boosting accomplish this trick?
Start with a learner. Look at the mistakes that learner makes. Make a new learner that diagnoses the mistakes of the first learner. Look at the mistakes that the first two learners make. Repeat. AdaBoost AdaBoost is a simple classification algorithm to turn weak learners into strong learners by iteratively reweighing the sample. So each subsequent learner pays more attention to the cases that were hard to classify for the first learner. The weak learners can even be “decision stumps” – decision trees with only one split. AdaBoost Algorithm Each observation has a weight, as does each learner. The observation weights start as equal. The algorithms works as follows: 1 Fit the m-th learner on the predictions it makes. 2 Compute how many mistakes it makes. Observations with more weight count for more. 3 Choose the weight of the learner so that a learner which makes few mistakes has a high weight. 4 Update the observation weights so that the observations it has trouble with count for more. The final predictor is the weighted sum of the individual learners. Why Does AdaBoost Work?
The steps in the algorithm do two things:
Data points that are hard to classify are given more weight. Learners that do well count for more. Gradient Boosting
Gradient boosting (or gradient boosting
machine) is the leading general-purpose machine learning algorithm. For example, most winning entries on Kaggle – a machine learning competition site – use gradient boosting. AdaBoost is restricted to classification. Gradient boosting is suitable for combining arbitrary learners. Gradient Boosting
The idea is simple:
Use the first learner to predict the data. Compute the “residuals” from the first learner, and use the second learner to predict the residuals. etc. Output of gradient boosting
You fit a sequence of (small) decision trees:
The first tree fits the data. The second, third, etc. tree fit the error from the previous tree. The final model combines all of the trees, like in AdaBoost. Putting the “gradient” in gradient boosting
The challenge is to define a “residual” for
each possible loss function. Gradient boosting uses the gradient of the loss function on the data – this points in the direction of best improvement. (For regression, this is exactly the residuals returned by R.) Gradient Boosting Algorithm
Gradient boosting can be viewed as a form
of gradient descent. You have a measure of predictive accuracy, called the loss. Each step you compute the gradient of the loss, and fit a tree to that gradient. You then take a step in the direction of the tree. Final model is a sum of tree predictions. Regularization of Boosting Boosting can overfit. Each individual tree tends to be small (gradient boosted trees use stumps by default). At the same time, each new tree tries to fit more and more hard-to-match aspects of the data. The main way to regularize is to cut back on the number of trees. This makes regularization easy to implement – just look at what happens when you throw out the later trees. Boosting Implementations
There are several gradient boosting
implementations, which are worth comparing: gbm – original implementation. xgboost – most high-profile implementation. lightgbm – by Microsoft. catboost – Specialized to classification problems. Originally by Yandex. Pros and Cons of Bagging and Boosting
Pros: Fit complex data much better.
Cons: While we can understand individual trees, the whole model is an opque black box.