Support Vector Machine
Table of contents
Basic concept of SVM
Support vector machine (SVM) is a supervised machine learning algorithm that can be used for both classification and regression.
Support Vector Classifier
Support vector classifier (SVC) is a binary classifier that finds the optimal hyperplane that separates the two classes.
Although SVC is inherently a binary classifier, it can be extended to multi-class classification (Lookup: one-vs-one, one-vs-rest, multiclass SVM).
Training data
- For
, is the -th observation.Do not confuse this notation with the
and in the figure below. The and in the figure are notating the two hypothetical features just to illustrate the concept of SVC in a 2D space.
In our case, each is a -dimensional vector that represents the features of the -th observation (so each colored dots in the figure below). is the (binary) class label of .
Hyperplane
A hyperplane is an affine subspace of dimension
in .A hyperplane in
is a line (1-dimensional subspace) and in is a plane (2-dimensional subspace).- We want to find the optimal hyperplane that separates the two classes.
- The optimal hyperplane is the one that maximizes the margin (Maximal Margin Classifier), which is the distance between the hyperplane and the closest observations.
We want to estimate a hyperplane defined by,
is the vector normal to the hyperplane. is the bias of the hyperplane.- If
, the hyperplane passes through the origin.
- If
Equation looks different
As in the figure above, sometimes people define the hyperplane with a negative bias term:
This is just a matter of notation and does not affect the estimation.
Support vectors
Support vectors are
More details
More generally, the support vectors are the observations that satisfy:
However, with data normalization, we can simplify
In the figure above the support vectors are the points on the dotted margin lines (two of which are blue and one green).
Support vectors are actual observations and not some hypothetical values that satisfy the conditions.
Optimization problem
Hard margin SVC
If the data are linearly separable, we can find a hyperplane that perfectly separates the two classes with a hard margin.
In order for classification to be perfect, we want the observations with label
This hard-margin contraint can be compactly written as:
Also, higher the distance, the more confident we are in the classification, hence the name.
Now we need to find a hyperplane maximizing the margin while adhering to the constraint above, where the size of the margin is defined as:
What?
Let
Let
We know that
If we start at
We want to solve for
Since the margin is twice the distance
So our maximization problem is:
We instead solve a minimization problem of:
We use
This primal problem can be solved using Lagrange multipliers.
Soft margin SVC
In reality, the data are not always linearly separable with a hard margin.
We can relax the constraint to allow some mistakes, but instead penalize the misclassification.
To account for the misclassifications, we introduce a slack variable
The right side of the
If the observation is misclassified
Geometrically,
We obviously want this slack variable to be minimized.
With this slack variable, we now have a new constraint:
This soft-margin contraint can be compactly written as:
Notice that now our constraint is now loosened by its misclassified distance.
Since now we have two things to minimize, the original objective in addition to the sum of the slack variables, so we solve the following minimization problem:
where
This
- With a huge
, we are essentially solving a hard-margin problem.- Because in attempt to minimize the slack sum with a huge
, we will end up allowing slacks of near-zero values which is as rigid as a hard-margin.
- Because in attempt to minimize the slack sum with a huge
- With a small
, we are giving more rooms for slack during the minimization.
SVC as an Inner Product
SVM optimization problem can be solved by Lagrange multiplier method.
Which leads to a inner product representation of the SVC.
Long story long
We have the objective function to minimize and a constraint function.
The objective function is bound to the constraint, meaning it’s gotta touch the constraint function at some point.
At this tangent point, the gradient of the objective function and the gradient of the constraint function are parallel (may be different directions and magnitudes, but on a same line).
Lagrange multiplier is a
The objective, constraint, and the Lagrange multiplier are combined to form a Lagrangian function, and we try to minimize this Lagrangian.
For the optimal solution, just like any minimization, we just solve for when the gradient of the Lagrangian is zero.
This Lagrangian is the called the primal form of the optimization problem, or primal problem. There is actually a dual form of the optimization problem.
This is a very rough idea of what’s going on.
The objective function and the multipliers kind of fight each other just like models with regularization fight between the objective and the penalty.
In the primal sense, we try to minimize the Lagrangian by minimizing over original variables. But we could achieve the same by maximizing over the multipliers.
Why the hell do we do this? Because turns out the dual form for SVM linear support vector classifier can be computed simply with inner products of the observations
The classifier can also be expressed as:
where
This is great computationally, but also essential for the kernel trick that we will see in the next section.
Kernel trick
Another way to solve non-linearly separable classifications is to use the kernel trick to transform the data.
Commonly used method to solve non-linearly separable classifications is to use the kernel trick in conjunction with the soft-margin SVC.
Let feature map
Cover's theorem
Cover’s theorem states that a linearly non-separable dataset in a lower-dimensional space may be linearly separable in a higher-dimensional space.
Kernel
In other words, the kernel function is a function that given any two vectors in the input space, computes the dot product of them in the feature space.
Loose intuition
In the usual input space, the optimal solution given by the Lagrange multiplier method involves calculating the dot product of the observations
Since we now inflate the input space to a higher-dimensional feature space, we need to calculate the dot product of the observations in the feature space.
So in order to replace the inner product of inputs, we need to find a kernel that can compute the dot product of the observations in the feature space for all possible pairs of observations.
One key point is that as long as we can find a kernel with output that lives in the inner product space, we don’t have to have an explicitly defined feature map
Radial basis function (RBF) kernel
Most commonly used kernel for SVC is the radial basis function (RBF) kernel.
The RBF kernel is defined as:
Oftentimes, we define
Again,
so that the RBF kernel is defined as:
where
Intuition
When two observations are close to each other, the sum of squared differences will be small, making the entire kernel value larger.
When two observations are far from each other, the sum of squared differences will be large, making the entire kernel value smaller.
SVC replaced with the RBF kernel is represented as:
If the new observation
So the RBF kernel shrinks the contribution of the support vectors as observation moves away from them.
- Low
: decision boundary is less flexible, more linear - High
: decision boundary is more flexible
Support Vector Regression
Support vector regression (SVR) is a regression model using the similar concept as the soft margin SVC.
What we do is try to fit a hyperplane that minimizes the residuals, while allowing some slack.
In soft margin SVC, we had the hinge loss to define the slack variable.
In SVR, we use the
where
While the hinge loss penalizes any deviation from the correct side of the margin, the
We are insensitive to the residuals within