0% found this document useful (0 votes)
296 views

HW 1 Sol

1. The document provides solutions to problems in Bayesian decision theory from a machine learning homework assignment. 2. It derives the minimum error decision boundary for a two-category one-dimensional problem with Cauchy distributions, showing the boundary is midway between the peaks. 3. It calculates the minimum probability of error for this problem and shows it depends on the difference between the distribution means divided by their width.

Uploaded by

SekharNaik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
296 views

HW 1 Sol

1. The document provides solutions to problems in Bayesian decision theory from a machine learning homework assignment. 2. It derives the minimum error decision boundary for a two-category one-dimensional problem with Cauchy distributions, showing the boundary is midway between the peaks. 3. It calculates the minimum probability of error for this problem and shows it depends on the difference between the distribution means divided by their width.

Uploaded by

SekharNaik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

1

ENGR 691/692 Section 66 (Fall 06): Machine Learning


Homework 1: Bayesian Decision Theory (solutions)

Assigned: August 30
Due: September 13

Problem 1: (22 pts) Let the conditional densities for a two-category one-dimensional problem be given by the
following Cauchy distribution:
1
1
p(x|i ) =

xai 2 , i = 1, 2.
b 1 +
b
1. (6 pts) By explicit integration, check that the distribution are indeed normalized.
2
, that is, the minimum error
2. (9 pts) Assuming P (1 ) = P (2 ), show that P (1 |x) = P (2 |x) if x = a1 +a
2
decision boundary is a point midway between the peaks of the two distributions, regardless of b.

3. (7 pts) Show that the minimum probability of error is given by






1
1
1 a1 a2
.
P (error) = tan
2
2b
Answer:

1.

1
p(x|i )dx =
u=
b

We substitute y =

xai
b

1+

1
xai 2 dx .
b

into the above and get


k

=
=
=
=


1
1
dy
1 + y 2

1
tan1 (y)


1
+
2
2
1.

2. By setting p(x|1 )P (1 ) = p(x|2 )P (2 ), we have


1
1
1
1
1
1

xa1 2 =
xa2 2 ,
b 1 +
2
b 1 +
2
b
b
or, equivalently,
x a1 = (x a2 ) .
For a1 = a2 , this implies that x =

a1 +a2
.
2

3. Without loss of generality, we assume a2 > a1 . The probability of error is defined as



P (error, x)dx
P (error) =


P (error|x)p(x)dx .
=

Note that the decision boundary is at


P (error|x)

a1 +a2
2 ,

hence

P (2 |x) if
=
P (1 |x) if
p(x| )P ( )
2

p(x)
p(x|1 )P (1 )
p(x)

a1 +a2
2
a1 +a2
2
2
x a1 +a
2
a1 +a2
x> 2

x
x>
if
if

2
Therefore, the probability of error is

a1 +a2
2

P (error) =

1
2b

=
We substitute y =

xa2
b

and z =

P (error)


p(x|2 )P (2 )dx +

a1 +a2
2

a1 +a2
2

p(x|1 )P (1 )dx

1
xa2 2 dx +
2b
1+( b )

a1 +a2
2

1+

dx
1 2
( xa
b )

xa1
b

into the above and get


a1 a2


2b
1
1
1
dy +
dz
a2 a1 1 + z 2
2
1 + y2
2b
a a



1 2
1

1
1
2b

tan (y) + tan (z) a2 a1
2
2b

a
1
2
1 1
1 a2 a1
+ + tan
tan
2
2b
2
2
2b
1
1
1 a2 a1
tan
.
2
2b

=
=
=
=

Similarly, if a1 > a2 , we have P (error) =

1
2

tan1

a1 a2
2b .

Therefore, we have shown that






1
1
1 a1 a2
P (error) = tan
.
2
2b

Problem 2: (21 pts) Let max (x) be the state of nature for which P (max |x) P (i |x) for all i, i = 1, . . . , c.
1. (7 pts) Show that P (max |x) 1c .
2. (7 pts) Show that for the minimum-error-rate decision rule the average probability of error is given by

P (error) = 1 P (max |x)p(x)dx .
3. (7 pts) Show that P (error)

c1
c .

Answer:
1. Since P (max |x) P (i |x), we have
c

P (max |x)

i=1

c

P (i |x) = 1 .

i=1

Hence
cP (max |x) 1 ,
which implies that P (max |x) 1c .
2. By definition,


P (error) =

P (error|x)p(x)dx

[1 P (max |x)] p(x)dx



1
P (max |x)p(x)dx .

3
3. From 1 and 2, it is clear that

P (error) = 1


P (max |x)p(x)dx 1

1
1
c1
p(x)dx = 1 =
.
c
c
c

Problem 3: (22 pts) In many machine learning applications, one has the option either to assign the pattern to
one of c classes, or to reject it as being unrecognizable. If the cost for rejects is not too high, rejection may be a
desirable action. Let

i=j
i, j = 1, . . . , c
0
r i = c + 1
(i |j ) =

s otherwise,
where r is the loss incurred for choosing the (c + 1)th action, rejection, and s is the loss incurred for making
any substitution error.
1. (10 pts) Please derive the decision rule with the minimum risk.
2. (6 pts) What happens if r = 0?
3. (6 pts) What happens if r > s ?
Answer:
1. For i = 1, . . . , c,
R(i |x)

c

(i |j )P (j |x)

j=1

= s

c

P (j |x)

j=1,j=i

= s [1 P (i |x)] .
For i = c + 1,
R(c+1 |x) = r .
Therefore, the minimum risk is achieved if we decide i if R(i |x) R(c+1 |x), i.e., P (i |x) 1 rs , and
reject otherwise.
2. If r = 0, we always reject.
3. If r > s , we will never reject.
Problem 4: (12 pts + 10 extra points) Let the components of the vector x = [x1 , . . . , xd ]T be binary-valued
(0 or 1), and let P (j ) be the prior probability for the state of nature j and j = 1, . . . , c. We define
pij = P (xi = 1|j ), i = 1, . . . , d, j = 1, . . . , c,
with the components of xi being statistically independent for all x in j .
1. (12 pts) Show that the minimum probability of error is achieved by the following decision rule:
Decide k if gk (x) gj (x) for all j and k, where
gj (x) =

d

i=1


pij
+
ln(1 pij ) + lnP (j ) .
1 pij
i=1
d

xi ln

2. (10 extra pts) If the components of x are ternary valued (1, 0, or 1), show that a minimum probability
of error decision rule can be derived that involves discriminant functions gj (x) that are quadratic function
of the components xi .

Answer:
1. Consider the following discriminant function
gj (x) = ln [p(x|j )P (j )] = ln p(x|j ) + ln P (j ) .
The components of x are statistically independent for all x in j , then we can write the density as a product:
p(x|j )

d

p(xi |j )

i=1
d

pxiji (1 pij )1xi .

i=1

Thus we have the discriminant function


gj (x)

d

[xi ln pij + (1 xi ) ln(1 pij )] + ln P (j )

i=1

d


pij
xi ln
+
ln(1 pij ) + ln P (j ) .
1 pij
i=1
i=1
d

2. Consider the following discriminant function


gj (x) = ln [p(x|j )P (j )] = ln p(x|j ) + ln P (j ) .
The components of x are statistically independent for all x in j , therefore,
p(x|j ) =

d

p(xi |j ) .

i=1

Let
pij

P (xi = 1|j ) ,

qij
rij

=
=

P (xi = 0|j ) ,
P (xi = 1|j ) .

It is not hard to check that


p(xi |j ) =

d

xi + 12 x2i 1x2i 12 xi + 12 x2i


qij rij

pij2

i=1

Thus the discriminant functions can be written as






d

1
1 2
1
1 2
2
xi + xi ln pij + (1 xi ) ln qij + xi + xi ln rij + ln P (j )
gj (x) =
2
2
2
2
i=1

d
d
d

pij rij
1
pij
=
x2i ln
+
xi ln
+
ln qij + ln P (j )
qij
2 i=1
rij
i=1
i=1
which are quadratic functions of the components xi .
Question 5: (23 pts) Suppose we have three categories with prior probabilities P (1 ) = 0.5, P (2 ) = P (3 ) =
0.25 and the class conditional probability distributions
p(x|1 )
p(x|2 )

N (0, 1)
N (0.5, 1)

p(x|3 )

N (1, 1)

5
where N (, 2 ) represents the normal distribution with density function
(x)2
1
p(x) =
e 22 .
2

We sample the following sequence of four points: x = 0.6, 0.1, 0.9, 1.1.
1. (9 pts) Calculate explicitly the probability that the sequence actually came from 1 , 3 , 3 , 2 .
2. (6 pts) Repeat for the sequence 1 , 2 , 2 , 3 .
3. (8 pts) Find the sequence of states having the maximum probability.
Answer: It is straightforward to compute that
p(0.6|1 ) = 0.333225
p(0.1|1 ) = 0.396953
p(0.9|1 ) = 0.266085
p(1.1|1 ) = 0.217852

p(0.6|2 ) = 0.396953
p(0.1|2 ) = 0.368270
p(0.9|2 ) = 0.368270
p(1.1|2 ) = 0.333225

We denote X = (x1 , x2 , x3 , x4 ) and = ((1), (2), (3), (4)).


as
(1 , 1 , 1 , 1 ) (1 , 1 , 1 , 2 )
(1 , 1 , 2 , 1 ) (1 , 1 , 2 , 2 )
(1 , 3 , 1 , 1 ) (1 , 1 , 3 , 2 )
..
..
.
.
(3 , 3 , 3 , 1 )

(3 , 3 , 3 , 2 )

p(0.6|3 ) = 0.368270
p(0.1|3 ) = 0.266085
p(0.9|3 ) = 0.396953
p(1.1|3 ) = 0.396953 .

Clearly, there are 34 possible values of , such


(1 , 1 , 1 , 3 )
(1 , 1 , 2 , 3 )
(1 , 1 , 3 , 3 )
..
.
(3 , 3 , 3 , 3 )

For each possible value of , we calculate P () and P (x|) using the following, which assume the independences
of xi and (i):
p(X|)

P () =

4

i=1
4

p(xi |w(i))
P ((i)) .

i=1

For example, if = (1 , 3 , 3 , 2 ) and X = (0.6, 0.1, 0.9, 1.1), then we have


p(X|)

= p((0.6, 0.1, 0.9, 1.1)|(1, 3 , 3 , 2 ))


= p(0.6|1 )p(0.1|3 )p(0.9|3 )p(1.1|2 )
= 0.333225 0.266085 0.396953 0.333225
= 0.01173

and
P () =
=
=

P (1 )P (2 )P (3 )P (4 )
1 1 1 1

2 4 4 4
0.0078125 .

1. Given X = (0.6, 0.1, 0.9, 1.1) and = (1 , 3 , 3 , 2 ), we have


p(X)

= p(x1 = 0.6, x2 = 0.1, x3 = 0.9, x4 = 1.1)



=
p(x1 = 0.6, x2 = 0.1, x3 = 0.9, x4 = 1.1|)P ()

6
= p(x1 = 0.6, x2 = 0.1, x3 = 0.9, x4 = 1.1|1 , 1 , 1 , 1 )P (1 , 1 , 1 , 1 )
+p(x1 = 0.6, x2 = 0.1, x3 = 0.9, x4 = 1.1|1 , 1 , 1 , 2 )P (1 , 1 , 1 , 2 )
..
.
+p(x1 = 0.6, x2 = 0.1, x3 = 0.9, x4 = 1.1|3 , 3 , 3 , 3 )P (3 , 3 , 3 , 3 )
= p(0.6|1 )p(0.1|1 )p(0.9|1 )p(1.1|1 )P (1 )P (1 )P (1 )P (1 )
+p(0.6|1 )p(0.1|1 )p(0.9|1 )p(1.1|2 )P (1 )P (1 )P (1 )P (2 )
..
.
+p(0.6|3 )p(0.1|3 )p(0.9|3 )p(1.1|3 )P (3 )P (3 )P (3 )P (3 )
= 0.012083 .
Therefore,
P (|X) =
=
=
=

P (1 , 3 , 3 , 2 |0.6, 0.9, 0.1, 1.1)


p(0.6, 0.9, 0.1, 1.1|1, 3 , 3 , 2 )P (1 , 3 , 3 , 2 )
p(X)
0.01173 0.0078125
0.012083
0.007584 .

2. Following the steps in part 1, we have


P (1 , 2 , 2 , 3 |0.6, 0.1, 0.9, 1.1) =
=
=

p(0.6, 0.1, 0.9, 1.1|1, 2 , 2 , 3 )P (1 , 2 , 2 , 3 )


p(X)
0.01794 0.0078125
0.012083
0.01160 .

3. The sequence = (1 , 1 , 1 , 1 ) has the maximum probability to observe X = (0.6, 0.1, 0.9, 1.1). This
maximum probability is 0.03966.

You might also like