0% found this document useful (0 votes)
40 views

Tute1 Questions

This document contains a machine learning tutorial with 4 questions covering support vector machines, hinge loss, softmax transformations, and cross entropy. (1) It proves that two optimization problems generate the same hard-margin SVM classifier and derives the first-order condition for the Lagrangian. Support vectors are defined as observations where the Lagrange multipliers are nonzero. (2) It calculates the conditional hinge risk and shows the Bayes classifier minimizes this risk and population classification error. (3) It proves representations involving the sigmoid and softmax functions and shows the Bayes classifier equals the sign of the log odds function. (4) It analyzes properties of cross entropy loss such as convexity and shows its

Uploaded by

Lucas Bakker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Tute1 Questions

This document contains a machine learning tutorial with 4 questions covering support vector machines, hinge loss, softmax transformations, and cross entropy. (1) It proves that two optimization problems generate the same hard-margin SVM classifier and derives the first-order condition for the Lagrangian. Support vectors are defined as observations where the Lagrange multipliers are nonzero. (2) It calculates the conditional hinge risk and shows the Bayes classifier minimizes this risk and population classification error. (3) It proves representations involving the sigmoid and softmax functions and shows the Bayes classifier equals the sign of the log odds function. (4) It analyzes properties of cross entropy loss such as convexity and shows its

Uploaded by

Lucas Bakker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Machine Learning for Econometrics

Week 1 Tutorials Questions

Lecturer: Yi He

2 November, 2023

Question 1: Support Vector Machines

(a) Prove that the following optimization problems on page 8 of lecture slides generate the same
hard-margin SVM classifier:

Optimization Problem 1

maximize M
w,w0 ,M

subject to Yi · (wT Xi + w0 ) ≥ M ∀i

∥w∥ = 1

Optimization Problem 2

1 T
minimize β β
β,β0 2
subject to Yi (β T Xi + β0 ) ≥ 1 ∀i

Assume that optimization problem 1 has a solution and the length of margin is positive.

(b) Henceforth we only consider the second optimization problem. Differentiate the Lagrange
function
n
1 T X
β β+ λi (1 − Yi (β T Xi + β0 )), λi ≥ 0,
2
i=1

with respect to β to get the first-order condition:


n
X
β= λi Yi Xi .
i=1

1
(c) Let S = {i : λ
bi ̸= 0} be the so-called support set, where λ
bi is the optimal solution of λi in part

(b). By the slackness condition (you do not need to prove this) we also know that

S = {i : Yi (βbT Xi + βb0 ) = 1}.

P bi Yi . The observations {Xi : i ∈ S} are therefore called


Verify that βb = i∈S α
bi Xi for α
bi = λ
support vectors.

Question 2: Hinge Loss

Consider a two-class classification problem with a target Y ∈ {−1, 1} and features X ∈ Rd . Define
p(x) = P(Y = 1|X = x) for all x ∈ Rd .

(a) Calculate the conditional hinge risk

E[ℓH (f (X), Y ) | X = x]

in terms of p(x) and any candidate prediction rule f : Rd → R.

(b) Suppose that p(x) ∈


/ {0, 1/2, 1}. Show that the Bayes classifier

1

p(x) > 1/2
Bayes
C (x) =
−1 p(x) < 1/2

minimizes conditional expected Hinge loss in part (a) over all candidate prediction rules f .

(c) Show that C Bayes (x) also minimizes the population classification error

P(Y ̸= sign(f (X)) | X = x) = P (Y f (X) ≤ 0 | X = x)

over all candidate prediction rules f : Rd → R.

Question 3: Softmax Transformation

Consider again the two-class classification problem from Question 2. Let σ denote the sigmoid
function given by
1
σ(a) = , a ∈ R.
1 + exp(−a)

2
(a) Prove that the bivariate regression function for the one-hot encoded target satisfies the following
representation:

P(Y = 1|X = x) = σ(a(x)), P(Y = −1|X = x) = σ(−a(x))

with
P(Y = 1|X = x)
a(x) = log .
P(Y = −1|X = x)

(b) Using part (a), verify that


   
P(Y = 1|X = x) a(x)
  = σ 
P(Y = −1|X = x) 0

where σ denotes the softmax transformation.

(c) Prove that the Bayes classifier equals to the sign of the log odds function a(x), that is,

1

a(x) > 0
Bayes
C (x) = sign(a(x)) =
−1 a(x) < 0.

Question 4: Cross Entropy


PK
Let yb, y o ∈ Pk = {y ∈ (0, 1)K : k=1 yk = 1} represent two distributions: yb = (b
y1 , . . . , ybK ) is an
estimate of the true distribution y o = (y1o , . . . , yK
o ). Consider the cross-entropy loss given by

K
X
o
y, y ) = −
ℓCE (b yko log ybk .
k=1

y , y o ) ̸= ℓCE (y o , yb).
Note that ℓCE (b

(a) Prove that log x ≤ x − 1 for all x > 0, where the equality holds if and only if x = 1.

(b) Use part (a) to prove that the cross-entropy loss is minimal at the true distribution y o , that is,

y , y o ) ≥ ℓCE (y o , y o ).
ℓCE (b

When does the equality hold?

(c) Use the concavity of the logarithm function to verify that the cross-entry loss is convex in yb,
meaning that for any two estimates yb, ye ∈ Pk and λ ∈ (0, 1),

y , y o ) ≤ λℓCE (b
y + (1 − λ)e
ℓCE (λb y , y o ) + (1 − λ)ℓCE (e
y , y o ).

3
(d) Let Y ∈ {1, . . . , K} be a target variable from the true distribution y o such that P(Y = k) = yko .
Show that the log-likelihood for parameters y o is given by
K
X
y |Y ) =
log L(b 1[Y = k] log ybk .
k=1

(e) Compare the expression of the log-likelihood function in (d) with the cross-entropy error. Can
you find a relation between the method of maximum likelihood and minimum cross-entropy?

You might also like