# Support Vector Machines

Week 5 | Lesson 5.1

Note: This lesson is a bonus/optional component for Week 5. Feel free to swap this out for additional review & project work as needed.

### LEARNING OBJECTIVES

*After this lesson, you will be able to:*

- Describe what a Support Vector Machine (SVM) model is
- Explain the math that powers it
- Evaluate pros/cons compared to other models
- Know how to tune it

### STUDENT PRE-WORK

*Before this lesson, you should already be able to:*

- Perform regression
- Perform regularization

### INSTRUCTOR PREP

*Before this lesson, instructors will need to:*

- Read in / Review any dataset(s) & starter/solution code
- Generate a brief slide deck
- Prepare any specific materials
- Provide students with additional resources

### LESSON GUIDE

TIMING | TYPE | TOPIC |
---|---|---|

5 mins | Opening | Opening |

45 mins | Introduction | Introduction: Support Vector Machines |

10 mins | Demo | Demo: Linear SVM |

20 mins | Guided-practice | Guided Practice: Tuning an SVM |

5 mins | Conclusion | Conclusion |

## Opening (5 mins)

Today we will learn about Support Vector Machines.

**Check:** What do you think the name means?

## Introduction: Support Vector Machines (45 mins)

A *support vector machine* (SVM) is a binary linear classifier whose decision boundary is explicitly constructed to minimize generalization error.

Recall:

- Binary classifier – solves two-class problem
- Linear classifier – creates linear decision boundary (in 2d)

The decision boundary is derived using geometric reasoning (as opposed to the algebraic reasoning we’ve used to derive other classifiers). The generalization error is equated with the geometric concept of margin, which is the region along the decision boundary that is free of data points.

The goal of an SVM is to create the linear decision boundary with the largest margin. This is commonly called the *maximum margin hyperplane* (MMH).

Nonlinear applications of SVM rely on an implicit (nonlinear) mapping that sends vectors from the original feature space K into a higher-dimensional feature space K’. Nonlinear classification in K is then obtained by creating a linear decision boundary in K’. In practice, this involves no computations in the higher dimensional space, thanks to what is called the *kernel trick*.

### Decision Boundary

The decision boundary (MMH) is derived by the discriminant function:

$$ f(x) = w^T x + b

$$

where `w`

is the weight vector and `b`

is the bias. The sign of `f(x)`

determines the (binary) class label of a record `x`

.

As we said before, SVM solves for the decision boundary that minimizes generalization error, or equivalently, that has the maximum margin. These are equivalent since using the MMH as the decision boundary minimizes the probability that a small perturbation in the position of a point produces a classification error.

Selecting the `MMH`

is a straightforward exercise in analytic geometry (we won’t go through the details here). In particular, this task reduces to the optimization of the following convex objective function:

$$\text{minimize: } \space \frac{1}{2}||w||^2$$

$$\text{subject to: } y_i(w^T x_i + b) \geq 1 \text{ for } i = 1,..,N$$

Notice that the margin depends only on a subset of the training data; namely, those points that are nearest to the decision boundary.

These points are called the *support vectors*. The other points (far from the decision boundary) don’t affect the construction of the `MMH`

at all.

This formulation only works if the two classes are linearly separable, so that we can indeed find a margin to separate them. Usually however, classes are not separable, and there is partial overlap between them. This requires an extension of the formulation to accommodate for class overlap.

### Soft margin, slack variables

Class overlap is achieved by relaxing the minimization problem or softening the margin. This amounts to solving the following problem:

$$ \text{minimize: } \space \frac{1}{2}||w||^2 + C \sum_{i=1}^N \xi_i $$

$$ \text{subject to: } y_i(w^T x_i + b) \geq 1 - \xi_j \text{ for } i = 1,..,N \text{ and } \xi_i > 0 $$

The hyper-parameter `C`

(soft-margin constant) controls the overall complexity by specifying penalty for training error. This is yet another example of *regularization*.

### Nonlinear SVM

The soft-margin optimization problem can be rewritten as:

$$ \text{maximize: } \sum_{i=1}^N \alpha_i

- \frac{1}{2}\sum
*{i=1}^N\sum*{j=1}^N y*i y_j \alpha_i \alpha_j x^Tx$$ $$ \text{subject to: } \sum*{i=1} y_i \alpha_i = 0, 0 \leq \alpha_i \leq C $$

Since the feature vector $x$ only appears in the inner product, we can replace this inner product with a more general function that has the same type of output as the inner product. This is called *kernel trick*.

Formally, we can think of the inner product as a map that sends two vectors in the feature space K into the real line $R$. A kernel function is a non-linear map the sends two vectors in a higher-dimensional feature space K’ into the real line $R$.

See here for a deeper tutorial on the math.

#### Some popular kernels

- Linear kernel: $ k(x, x') = x^T x' $
- Polynomial kernel: $ k(x, x') = (x^T x' + 1)^d$
- Gaussian kernel (rbf): $ k(x, x') = \exp{(-\gamma||x - x'||^2)} $

The hyperparameters $d$ and $\gamma$ affect the flexibility of the decision boundary.

## Demo: Linear SVM (10 mins)

Scikit-learn implements support vector machine models in the `svm`

package.

```
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
```

```
model = SVC(kernel='linear')
model.fit(X, y)
```

```
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
```

Notice that the SVC class has several parameters. In particular we are concerned with two:

`C`

: penalty parameter of the error term (regularization)`kernel`

: the type of kernel used (`linear`

,`poly`

,`rbf`

,`sigmoid`

,`precomputed`

or a callable.)

Notes from the documentation:

- In the current implementation the fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.
- The multi-class support is handled according to a one-vs-one scheme.

As usual we can calculate the cross validated score to judge the quality of the model.

```
from sklearn.cross_validation import cross_val_score
cvscores = cross_val_score(model, X, y, cv = 5, n_jobs=-1)
print "CV score: {:.3} +/- {:.3}".format(cvscores.mean(), cvscores.std())
```

```
CV score: 0.98 +/- 0.0163
```

## Guided Practice: Tuning an SVM (20 minutes)

An SVM almost never works without tuning its parameter.

**Check:** Try performing a grid search over kernel type and regularization strength to find the optimal score for the above data.

Answer:

`from sklearn.grid_search import GridSearchCV parameters = {'kernel':('linear', 'rbf'), 'C':[0.1, 1, 3, 10]} clf = GridSearchCV(model, parameters, n_jobs=-1) clf.fit(X, y) clf.best_estimator_`

**Check:** Can you think of pros and cons for Support Vector Machines

Pros:

- Very powerful, good performance
- Can be used for anomaly detection (one-class SVM)
Cons:

- Can get very hard to train with lots of data
- Prone to overfit (need regularization)
- Black box

## Conclusion (5 mins)

In this class we have learned about Support Vector Machines. We've seen how they are powerful in many situations and what can some of their limitations be.

Can you think of a way to apply them in business?