Large Margin Classification
SVM’s Cost Function
Support Vector Machines take the magenta line to replace the Sigmoid term.
This give SVM computational advantages and user an easier optimization problem.
Optimization Objective
- Logistic regression
- Support vector machine
Hypothesis
SVM Decision Boundary: Linearly separable case
For the Linear separable case, SVM will choose the line(black one) which has a larges distance(the margin)
The distance is called the margin of the Support vector machine and this gives the SVM a certain robustness, because it tries to separate the data with as a large a margin as possible.
The Parameter C
- if
was large, the classifier will be more sensitive. plays a role similar to one over Lambda
Mathemativs Behind Largr Margin Classification
norm of a vector
The norm:
if the vector is 2-dim. Norm’s calculate method is as same as the Pythagoras theorem
Derivation
if the
the Optimization Objective will change as the following form:
There are another relationship inside:
where
Simplification:
the whole derivation is using the simplification that the parameter
i equal to zero
It turns out that this same large margin proof works in pretty much in exactly the same way.
Kernels
when we are treating Non-linear Decision Boundary, we need to choose different features.
One way to have the features is called Kernel Method.
Given x, compute new feature depending on proximity to landmarks.
Each of the landmark defines a new feature
Similarity Function
Similarity Function is a way to measure an examples’ proximity to a landmark. The function above is so-called Gaussian Kernel
If
If
more details
Landmarks Choice
all the examples’s feature will be taken to be landmarks. So the no. of
Given
choose
Given example
For training example
Bias-Variance Trade Off
SVM with kernel
Hypothesis: Given
Predict “y=1” if
Training:
SVM parameters:
C( = 1 over lambda).
Large C: Lower bias, higher variance. (small
Small C: Higher bias, lower variance. (large
sigma square
Large
Higher bias, lower variance.
Small
Lower bias, higer variance.
SVMs in Practice
Actually we seldomly implement ourselves but the the off-the-shelf function libs(e.g. liblinear, libsvm,…)
First step
choose a kind of good software libraries for your propagramming.
And than choose the parameter C
Need to specify:
- Choice of parameter C
- Choice of kernel(similarity function):
E.G. No kernel(“linear kernel”) - predict “y=1” if
Second Step
Choose the kernel or the similarity function that u wanna use.
Gaussian kernel:
Need to choose
Some packet need u to implement the Kernel(similarity) functions:
1 | function f = kernel(x1, x2) |
Note
if features take on very different ranges of value.
Do perform feature scaling using the Gaussian kernel.
Other Kernel
The Linear Kernel and Gaussian Kernel are 2 most common Similarity Function.
NOT all similarity functions
(Need to satisfy technical condition called “Mercer’s Theorem” to make sure SVM packages’s optimizations run correctly, and do not diverge)
Mercer’s Theorem: 任何半正定的函数都可以作为核函数。
Many off-the-shelf kernels available:
- Polynomial kernels
- More esoteric: String kernel, chi-square kernel, histogram intersection kernel….
Multi-class
- Many SVM packages already have built-in multi-class classification functionality,
- Otherwise, one-vs.-all method
Logistic Regression vs. SVMs
n = no. of features (
m = no. of training examples
n is large
relative to m
E.g.
Using logistic regression, or SVM without a kernel(“linear kernel”)
n small, m intermediate
E.g.
Use SVM with Gaussian kernel
n small, m large
E.g.
Create/add more features, then use logistic regression or SVM without akernel
Neural network likely to work well for most of these settings, but may be slower to train.
- Post title: 12_Support Vector Machine
- Create time: 2022-02-16 01:00:39
- Post link: Machine-Learning/12-support-vector-machine/
- Copyright notice: All articles in this blog are licensed under BY-NC-SA unless stating additionally.