0% found this document useful (0 votes)
50 views

Data Mining Using Decision Trees: Professor J. F. Baldwin

The document describes how decision trees can be used to classify data based on attribute values, showing an example of classifying examples as satisfying a concept or not based on their size, color, and shape attributes; it explains the ID3 algorithm for building decision trees by choosing the attribute that best splits the data at each node based on information gain; and it walks through applying the ID3 algorithm to build a decision tree to classify the examples in the example data set.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Data Mining Using Decision Trees: Professor J. F. Baldwin

The document describes how decision trees can be used to classify data based on attribute values, showing an example of classifying examples as satisfying a concept or not based on their size, color, and shape attributes; it explains the ID3 algorithm for building decision trees by choosing the attribute that best splits the data at each node based on information gain; and it walks through applying the ID3 algorithm to build a decision tree to classify the examples in the example data set.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Mining using Decision

Trees

Professor J. F. Baldwin
Decision Trees from Data Base
Ex Att Att Att Concept
Num Size Colour Shape Satisfied

1 med blue brick yes


2 small red wedge no
3 small red sphere yes
4 large red wedge no
5 large green pillar yes
6 large red pillar no
7 large green sphere yes

Choose target : Concept satisfied


Use all attributes except Ex Num
CLS - Concept Learning
System - Hunt et al.

Tree Structure

Node with mixture


of +ve and -ve Attribute Parent node
examples V

v1 v2 v3

Children nodes
CLS ALGORITHM
1. Initialise the tree T by setting it to consist of one
node containing all the examples, both +ve and -ve,
in the training set
2. If all the examples in T are +ve, create a YES node and HALT

3. If all the examples in T are -ve, create a NO node and HALT

4. Otherwise, select an attribute F with values v1, ..., vn


Partition T into subsets T1, ..., Tn according to the values on F.
Create branches with F as parent and T1, ..., Tn as child nodes.

5. Apply the procedure recursively to each child node


Data Base Example
Using attribute SIZE

{1, 2, 3, 4, 5, 6, 7}
SIZE

med large
small
{1} {2, 3} {4, 5, 6, 7}

YES Expand Expand


Expanding
{1, 2, 3, 4, 5, 6, 7}
SIZE

med large
small
{1} {2, 3} {4, 5, 6, 7}
COLOUR SHAPE
wedge sphere
YES {2, 3}
SHAPE pillar
wedge {4} {5, 6} {7}
sphere
COLOUR
{2} {3} red
No green Yes
no yes {6} {5}
No Yes
Rules from Tree
IF (SIZE = large
AND
((SHAPE = wedge) OR (SHAPE = pillar AND
COLOUR = red) )))
OR (SIZE = small AND SHAPE = wedge)
THEN NO

IF (SIZE = large
AND
((SHAPE = pillar) AND COLOUR = green)
OR SHAPE = sphere) )
OR (SIZE = small AND SHAPE = sphere)
OR (SIZE = medium)
THEN YES
Disjunctive Normal Form - DNF
IF
(SIZE = medium)
OR
(SIZE = small AND SHAPE = sphere)
OR
(SIZE = large AND SHAPE = sphere)
OR
(SIZE = large AND SHAPE = pillar
AND COLOUR = green
THEN CONCEPT = satisfied

ELSE CIONCEPT = not satisfied


ID3 - Quinlan
Attributes are chosen in any order for the CLS algorithm.
This can result in large decision trees if the ordering is not
optimal. Optimal ordering would result in smallest decision
Tree.

No method is known to determine optimal ordering.


We use a heuristic to provide efficient ordering which
will result in near optimal ordering

ID3 = CLS + efficient ordering of attributes

Entropy is used to order the attributes.


Entropy
For random variable V which can take values {v1, v2, …, vn}
with Pr(vi) = pi, all i, the entropy of V is given by
n
S(V) = − ∑ p i ln(pi )
i=1

6 1 ⎛1 ⎞
Entropy for a fair dice = S = − ∑ ln ⎜ ⎟ = ln(6) = 1.7917
i=1 6
⎝6 ⎠
3 1 ⎛1 ⎞
Entropy for fair dice with even score = S = − ∑ ln⎜⎝ ⎟ = ln(3) = 1.0986
i=1 3 3⎠

Information gain = 1.7917 - 1.0986 = 0.6931 Differences between


entropies
Attribute Expansion
Expand attribute Ai -
other
attributes Ai T Pr
Equally likely unless specified
Pr(A1, …Ai, …An, T)

Attributes
Except
ai1 aim
Ai
T Pr T

Pr(A1, …Ai-1, Ai+1, …An, T | Ai = ai1)


Pass probabilities corresponding to ai1 from above and re-normalise
-equally likely again if previous equally likely
Expected Entropy for an
Attribute
Attribute Ai and target T -
Ai T Pr
Pr(Ai, T) = ∑ ... ∑ ∑ ... ∑ Pr(A1..., An ,T)
A1 A i−1 A i=1 An

S(a i1 ) = ∑ Pr(t k | a i1 )Ln Pr(t k | a i1 )


Pass probabilities corresponding to tk k
from above for ai1and re-normalise
ai1 aim
T Pr T Pr

Pr(T | Ai=aim)
S(ai1) S(ai2) S(aim)
Pr(A i ) = ∑ Pr(Ai ,T)
Expected Entropy for Ai = S(Ai ) = ∑ Pr(a ik )S(a ik ) T
k
How to choose attribute and
Information gain
Determine expected entropy for each attribute
i.e. S(Ai), all i
Choose s such that MIN{S(A j} = S(As )
j
Expand attribute As

By choosing attribute As the information gain is


S - S(As)
where S = ∑ Pr(T)Ln{Pr(T)} where Pr(T) = ∑ ... ∑ Pr(A1..., An , T)
T A1 A n

Minimising expected entropy is equivalent to maximising


Information gain
Previous Example
Ex Att Att Att Concept
Num Size Colour Shape Satisfied Pr

1 med blue brick yes 1/7


2 small red wedge no 1/7
3 small red sphere yes 1/7
4 large red wedge no 1/7
5 large green pillar yes 1/7
6 large red pillar no 1/7
7 large green sphere yes 1/7
Concept
satisfied Pr
yes S = (4/7)Log(4/7) + (3/7)Log(3/7) = 0.99
4/7
no 3/7
Entropy for attribute Size
Att Concept Pr
Information Gain Size Satisfied S(Size) = (2/7)1 + (1/7)0
for Size = 0.99 - 0.86 med yes 1/7 + (4/7)1 = 6/7 = 0.86
= 0.13 small no 1/7 Pr(large) = 4/7
Pr(small) = 2/7 small yes 1/7 large
small large no 2/7
large yes 2/7
Concept
Concept Satisfied Pr
Satisfied Pr no 1/2
no 1/2 Pr(med) = 1/7 med
yes 1/2
yes 1/2 Concept S(large) = 1
S(small) = 1 Satisfied Pr
yes 1
S(med) = 0
First Expansion
Attribute Information Gain
SIZE 0.13
COLOUR 0.52 max
SHAPE 0.7 choose

{1, 2, 3, 4, 5, 6, 7}
SHAPE

sphere
wedge pillar
brick

{2, 4} {1} {5, 6}


{3, 7}

NO YES Expand YES


Complete Decision Tree
{1, 2, 3, 4, 5, 6, 7}
SHAPE Rule:
IF
sphere
wedge
brick pillar Shape is wedge
OR
{2, 4} {1} {5, 6}
{3, 7} Shape is brick
COLOUR
OR
NO YES red green YES Shape is pillar AND
Colour is red
{6) {5} OR
Shape is sphere
YES
THEN NO
NO
ELSE YES
A new case
Att Att Att Concept
Size Colour Shape Satisfied

med red pillar ?

SHAPE
pillar

COLOUR
red

? = NO
Post Pruning
Any Node S

N examples
in node Let C be class
with most
E(S) n cases of C examples
C is one of i.e majority
{YES, NO}

Suppose we terminate this node and make it a leaf with


classification C.
What will be the expected error, E(S), if we use the tree
for new cases and we reach this node.

E(S) = Pr(class of new case is a class ≠ C)


Bayes Updating for Post Pruning
Let p denote probability of class C for new case arriving at S
We do not know p. Let f(p) be a prior probability distribution
for p on [0, 1]. We can update this prior using Bayes’ updating
with the information at node S.

The information at node S is n C in S

Pr(n C in S | p) f(p)
f(p | n in S) =
1
0 Pr(n C in S | p) f(p)dp
Mathematics of Post Pruning
Assume f(p) to be uniform over [0, 1]
The evaluation of the integral
n N–n
p (1-p) 1
f(p | n C in S) = a b
1 n N–n
0 x (1-x) dx =
0 p (1-p) dp n! (N – n + 1)!
(N + 2)!
using Beta Functions
E(S) = E (1 – p)
f(p | n C in S)

E(S) =
p n
(1-p)
N–n+1
dp
=
N–n+1
using
1 n N–n N+2 Beta Functions.
0 p (1-p) dp
Post Pruning for Binary Case
For leaf nodes Si
Error(S) =
S Error(Si) = E(Si)
MIN {E(S)
}
BackUpError(S)
Pm Num of examples in Si
Pi =
P1 P2 Num of examples in S

E(S)
Error(S1) S1 Error(S2) S2 Error(Sm) Sm BackUpError(S)

For any node S which is not a leaf


node we can calculate Decision: Prune at S if
BackUpError(S) ≥ Error(S)
BackUpError(S) = i Pi Error(Si)
Example of Post Pruning
Before Pruning [x, y]
0.417 a means x YES cases
0.378 [6, 4] and y NO cases
We underline Error(Sk)
0.375 b 0.5 c
0.413 [4, 2] PRUNE 0.383 [2, 2]

[1, 0]
[3, 2] [1, 0] d 0.333
0.4
0.429 0.333 0.444 [1, 2] PRUNE
PRUNE means cut the
[1, 1] [0, 1] sub- tree below this point
0.5 0.333
Result of Pruning
After Pruning
a
[6, 4]

[4, 2] c
[2, 2]

[1, 0]
[1, 2]
Generalisation

For the case in which we have k classes the generalisation


for E(S) is

N–n+k–1
E(S) =
N+k

Otherwise, pruning method is the same.


Testing
DataBase
Training Test Learn rules using Training Set and Prune
Set Set
Test rules on this set and record % correct

Test rules on Test Set record % correct

% accuracy on test set should be close to that of training set.


This indicates good generalisation

Over-fitting can occur if noisy data is used or too specific attributes


are used. Pruning will overcome noise to some extent but not
completely. Too specific attributes must be dropped.

You might also like