ML UNIT-2 Notes
ML UNIT-2 Notes
Unit-2
Decision Tree Learning: Representing concepts as decision trees, Recursive induction of decision trees,
Picking the best splitting attribute: entropy and information gain, Searching for simple trees and
computational complexity, Occam's razor, Over fitting, noisy data, and pruning. Experimental Evaluation of
Learning Algorithms: Measuring the accuracy of learned hypotheses. Comparing learning algorithms: cross-
validation, learning curves, and statistical hypothesis testing.
FIGURE : A decision tree for the concept PlayTennis. An example is classified by sorting it through the tree
to the appropriate leaf node, then returning the classification associated with this leaf (in this case, Yes or
No).
Test Sample: This decision tree classifies Saturday mornings according to whether they are suitable for
playing tennis. For example, the instance
<Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong>
would be sorted down the leftmost branch of this decision tree and would therefore be classified as a
negative instance (i.e., the tree predicts that PlayTennis = no).
In general, decision trees represent a disjunction of conjunctions of constraints on the attribute values of
instances. Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree
itself to a disjunction of these conjunctions.
(Outlook = Sunny Λ Humidity = Normal)
V (Outlook = Overcast)
V (Outlook = Rain Λ Wind = Weak) PlayTennis=Yes
Appropriate Problems for Decision Tree Learning:
• Instances are represented by attribute-value pair
• The target function has discrete output values
• Disjunctive descriptions may be required
• The training data may contain errors
• The training data may contain missing attribute values.
Recursive induction of decision trees(ID3 algorithm)
ENTROPY MEASURES HOMOGENEITY OF EXAMPLES
Let’s assume for the moment that we are dealing with Boolean features, so D is split into D1 and D2. Let’s
also assume we have two classes, and denote by and the positives and negatives in D .
The question is how to assess the utility of a feature in terms of splitting the examples into positives and
negatives.
The best situation is where and or where and
In that case, the two children of the split are said to be pure.
To measure the impurity of a set of positives and negatives.
impurity can be defined in terms of the proportion
Entropy:
Entropy refers to a common way to measure impurity. In the decision tree, it measures the randomness or
impurity in data sets.
Let us say S is the sample set of training examples. Then, Entropy (S) measuring the impurity of S is defined
as
where c is the number of different class labels and p refers to the proportion of values falling into the i-th
class label.
To illustrate, suppose S is a collection of 14 examples of some boolean concept, including 9 positive and 5
negatives relative to this boolean classification is negative examples (we adopt the notation [9+, 5-] to
summarize such a sample of data). Then the entropy of
Notice that
the entropy is 0 if all members of S belong to the same class. For example, if all members are
positive (p+ve = I), then p-ve, is 0, and Entropy(S) = -1 . log2(1) - 0 . log2 0 = -1 . 0 - 0 . log2 0 = 0.
the entropy is 1 when the collection contains an equal number of positive and negative examples.
If the collection contains unequal numbers of positive and negative examples, the entropy is
between0 and 1
Given entropy as a measure of the impurity in a collection of training examples, we can now define a
measure of the effectiveness of an attribute in classifying the training data. The measure we will use, called
information gain, is simply the expected reduction in entropy caused by partitioning the examples according
to this attribute.
The information gain, Gain(S, A) of an attribute A, relative to a collection of examples S, is defined as
where Values(A) is the set of all possible values for attribute A, and S v is the subset of S for which attribute
A has value v (i.e., Sv = {s ∈ S|A(s) = v}).
Information Gain:
The information gain is created on the basis of the decrease in entropy (S) after a data set is split according
to a particular attribute (A).
Information gain for a particular feature A is calculated by the difference in entropy before a split (or
SbS ) with the entropy after the split (SaS ).
Information Gain refers to the decline in entropy after the dataset is split. It is also called Entropy
Reduction. Building a decision tree is all about discovering attributes that return the highest data gain.
For example, suppose S is a collection of training-example days described by attributes including Wind,
which can have the values Weak or Strong. As before, assume S is a collection containing 14 examples, [9+,
5-]. Of these 14 examples, suppose 6 of the positive and 2 of the negative examples have Wind = Weak, and
the remainder have Wind = Strong. The information gain due to sorting the original 14 examples by the
attribute Wind may then be calculated as
Note: means coverages of weak feature
Information gain is precisely the measure used by ID3 to select the best attribute at each step in growing the
tree. The use of information gain to evaluate the relevance of attributes is summarized in below Figure. In
this figure the information gain of two different attributes, Humidity and Wind, is computed in order to
determine which is the better attribute for classifying the training examples shown in below Table.
Fig:Humidity provides greater information gain than Wind, relative to the target classification. Here, E
stands for entropy and S for the original collection of examples. Given an initial collection S of 9 positive
and 5 negative examples, [9+, 5-], sorting these by their Humidity produces collections of [3+, 4-] (Humidity
= High) and [6+, 1-] (Humidity = Normal). The information gained by this partitioning is .151, compared to
a gain of only .048 for the attribute Wind.
TABLE Training examples for the target concept PlayTennis.
The information gain values for all four attributes are
FIGURE:The partially learned decision tree resulting from the first step of ID3. The training examples are
sorted to the corresponding descendant nodes. The Overcast descendant has only positive examples and
therefore becomes a leaf node with classification Yes. The other two nodes will be further expanded, by
selecting the attribute with highest information gain relative to the new subsets of examples.
ALGORITHM:
ID3(Examples, Targetattribute, Attributes)
//Examples are the training examples. Targetattribute is the attribute whose value is to be
predictedby the tree. Attributes is a list of other attributes that may be tested by the learned decision
tree. Returns a decision tree that correctly classifies the given Examples.
Create a Root node for the tree
If all Examples are positive, Return the single-node tree Root, with label = +
If all Examples are negative, Return the single-node tree Root, with label = -
If Attributes is empty, Return the single-node tree Root, with label = most common value of
Target_attribute in Examples
Otherwise Begin
o A the attribute from Attributes that best* classifies Examples
o The decision attribute for Root A
o For each possible value, vi, of A,
Add a new tree branch below Root, corresponding to the test A = v i
Let Examples vi be the subset of Examples that have value vi for A
If Examples vi, is empty
Then below this new branch add a leaf node with label = most common value of
Target_attribute in Examples
Else below this new branch add the sub tree
ID3(Examples vi, Target_attribute, Attributes – {A}))
End
Return Root
Note: The best attribute is the one with highest information gain
The final decision tree learned by ID3 from the 14 training examples of PlayTennis is shown in Figure:
HYPOTHESIS SPACE SEARCH IN DECISION TREE LEARNING
ID3 searches the space of possible decision trees: doing hill-climbing on information gain.
It searches the complete space of all finite discrete-valued functions. All functions have at least one
tree that represents them.
It maintains only one hypothesis (unlike Candidate-Elimination). It cannot tell us how many other
viable ones there are.
Uses all training examples at each step. Results are less sensitive to errors.
Occam’s razor
– Prefer the simplest hypothesis that fits the data
– Argument in favor
• Fewer short hypotheses than long hypotheses
– Argument opposed
• There are many ways to define small sets of hypotheses
• What’s so special about small sets based on size of hypothesis?
Issues in Decision Tree Learning
• Determine how deeply to grow the decision tree
• Handling continuous attributes
• Choosing an appropriate attribute selection measure
• Handling training data with missing attribute values
• Handling attributes with differing costs
• Improving computational efficiency
i.e, ,
but h' has a smaller error than h over the entire distribution of instances.
i.e
Reduced-error pruning
– Split data into training set, validation set used for pruning, and test set for measuring accuracy over
future unseen examples.
– Do until further pruning is harmful:
1. Evaluate impact on validation set of pruning each possible node (plus those below it),
starting at its maximum size and lowest accuracy over test set.
2. Greedily remove the one that most improves validation set accuracy
– Produces smallest version of most accurate sub tree
– What if data is limited?
• Rule post-pruning
– Most frequently used method (e.g., ch.4.5)
1. Convert tree to equivalent set of rules
2. Prune each rule independently of others
3. Sort final rules into desired sequence for use
– In C4.5, evaluation of performance is based on the training set itself, using pessimistic estimate :
• Calculate the rule accuracy over the training set
• Calculate the standard deviation in this estimated accuracy assuming on a binomial
distribution.
• The lower-bound estimate is taken as the measure of rule performance for a given
confidence level.
• Converting a tree to rules
Advantages of rule post-pruning over reduced-error Pruning
• Allows distinguishing among the different contexts in which a decision node is used.
• Removes the distinction between attributes near the root and those near the leaves.
• Improves readability
– Find a set of thresholds midway Between different target values of the attribute : Temperature>54 and
Temperature>85
– Pick a threshold, c, that produces the greatest information gain : temperature>54
• Attributes with many values
– Problem
• If attribute has many values, Gain will select it
• Imagine using Date = Oct_13_2004 as attribute
– One approach: use GainRatio instead