12_Support Vector Machine

Large Margin Classification

SVM’s Cost Function

Support Vector Machines take the magenta line to replace the Sigmoid term.

This give SVM computational advantages and user an easier optimization problem.

Optimization Objective

Logistic regression
Support vector machine

Hypothesis

SVM Decision Boundary: Linearly separable case

For the Linear separable case, SVM will choose the line(black one) which has a larges distance(the margin)

The distance is called the margin of the Support vector machine and this gives the SVM a certain robustness, because it tries to separate the data with as a large a margin as possible.

The Parameter C

ifwas large, the classifier will be more sensitive.
plays a role similar to one over Lambda

Mathemativs Behind Largr Margin Classification

norm of a vector

The norm:
means the norm of, it’s also means the euclidean length of the vector

if the vector is 2-dim. Norm’s calculate method is as same as the Pythagoras theorem

Derivation

if theandall be zeros

the Optimization Objective will change as the following form:

There are another relationship inside:

whereis the projection ofonto the vector.
Simplification:

the whole derivation is using the simplification that the parameteri equal to zero
It turns out that this same large margin proof works in pretty much in exactly the same way.

Kernels

when we are treating Non-linear Decision Boundary, we need to choose different features.
One way to have the features is called Kernel Method.

Given x, compute new feature depending on proximity to landmarks.

Each of the landmark defines a new feature

Similarity Function

Similarity Function is a way to measure an examples’ proximity to a landmark. The function above is so-called Gaussian Kernel

If

Ifif far from

is a hyperparameter.

more details

Landmarks Choice

all the examples’s feature will be taken to be landmarks. So the no. of’s dims is as same as the features’s and the no. of training set.

Given,
choose.

Given example

For training example

Bias-Variance Trade Off

SVM with kernel

Hypothesis: Given, compute features $𝕞 𝟙$
Predict “y=1” if
Training:

SVM parameters:

C( = 1 over lambda).

Large C: Lower bias, higher variance. (small)
Small C: Higher bias, lower variance. (large)

sigma square

Large: Featuresvary more smoothly.
Higher bias, lower variance.

Small: Featuresvary less smoothly.
Lower bias, higer variance.

SVMs in Practice

Actually we seldomly implement ourselves but the the off-the-shelf function libs(e.g. liblinear, libsvm,…)

First step

choose a kind of good software libraries for your propagramming.

And than choose the parameter C

Need to specify:

Choice of parameter C
Choice of kernel(similarity function):

E.G. No kernel(“linear kernel”) - predict “y=1” if

Second Step

Choose the kernel or the similarity function that u wanna use.

Gaussian kernel:

Need to choose

Some packet need u to implement the Kernel(similarity) functions:

1
2
3

function f = kernel(x1, x2)
    f = ...
return

Note

if features take on very different ranges of value.
Do perform feature scaling using the Gaussian kernel.

Other Kernel

The Linear Kernel and Gaussian Kernel are 2 most common Similarity Function.

NOT all similarity functionsmak valid kernels.
(Need to satisfy technical condition called “Mercer’s Theorem” to make sure SVM packages’s optimizations run correctly, and do not diverge)

Mercer’s Theorem: 任何半正定的函数都可以作为核函数。

Many off-the-shelf kernels available:

Polynomial kernels
More esoteric: String kernel, chi-square kernel, histogram intersection kernel….

Multi-class

Many SVM packages already have built-in multi-class classification functionality,
Otherwise, one-vs.-all method

Logistic Regression vs. SVMs

n = no. of features ( $𝕟 𝟙$ ),
m = no. of training examples

n is large

relative to m

E.g.

Using logistic regression, or SVM without a kernel(“linear kernel”)

n small, m intermediate

E.g.

Use SVM with Gaussian kernel

n small, m large

E.g.
Create/add more features, then use logistic regression or SVM without akernel

Neural network likely to work well for most of these settings, but may be slower to train.