I have been trying to understand the difference between a regular Support Vector Machine, and a kernel Support Vector Machine. I have my own intuition, but I'm not sure if it is quite right. Below is my understanding in my own words; is this correct?

So my understanding is that with a regular SVM, you are trying to find a hyperplane which separates the data into classes, based on their ground-truth labels. The problem involves optimising the parameters of the separating hyperplane, such that the distance between the hyperplane and the closest data point for each class, is maximum. In this context, the "distance" is calculated in the original space of the data.

However, if the data is not linearly-separable, then we need to find a new space in which the data is linearly-separable, so that the "distance" from before can be calculated in this new space. You could just introduce some arbitrary new dimensions, e.g. (x1 * x2), (x1 * x1), etc., but there is no guarantee that this new space is appropriate. So instead, you use a kernel to compute the new space, which takes into account the actual locations of the datapoints, rather than just the space as a whole.

A kernel takes in two datapoints, and computes a scalar value which effectively describes the "similarity" of the two datapoints. Then, for each datapoint, the kernel is applied between this datapoint and every single other datapoint in the dataset. This datapoint can now be described by a vector of "similarities", one element for each of the other datapoints. In this new space, it may now be easier to separate out the data, because the new space actually represents similarities between datapoints, whereas in the original space, it just represented the real-world data.

And the reason why this new space is considered to have potentially infinite dimensions, is that you can have one dimension of the new space for each datapoint in your data; so as your dataset approaches infinite size, the dimensionality of the new dataset also approaches infinity.

Am I thinking about this in the right way?

Specifically, I'm not sure about the following:

1) My understanding of why the kernel SVM space is better than just introducing arbitrary new dimensions (because the kernel method actually considers the relationships between datapoints).

2) My understanding of how the optimal separating hyperplane is calculated in the new space correct (by taking a datapoint and computing its similarity to every other datapoint, and then creating a vector with one similarity value for each of these other datapoints).

3) My understanding of why the new space can be thought of as having virtually infinite dimensions (because the dataset can have a virtually infinite number datapoints, and therefore a virtually infinite number of "similarities" in the vector which describes new space).

**Contents**hide

#### Best Answer

Your understanding of linear SVMs sounds correct, but there may be some misconceptions about kernelized SVMs.

A kernelized SVM is equivalent to a linear SVM that operates in feature space rather than input space. Conceptually, you can think of this as mapping the data (possibly nonlinearly) into feature space, then using a linear SVM. However, the actual steps taken when using a kernelized SVM don't look like this because the kernel trick is used. Rather than explicitly mapping the data into feature space, the mapping is defined implicitly by the kernel function, which returns the dot product between feature space representations of two data points. Say $x$ and $x'$ are points in input space and $K$ is the kernel function. Then $K(x, x') = Phi(x) cdot Phi(x')$, where $Phi$ is a mapping from input space to feature space. Because a linear SVM can be formulated in terms of dot products, one can replace these dot products with kernel function evaluations to obtain a linear SVM that operates in feature space, without ever having to compute (or even know) $Phi$.

You could just introduce some arbitrary new dimensions…but there is no guarantee that this new space is appropriate. So instead, you use a kernel to compute the new space, which takes into account the actual locations of the datapoints, rather than just the space as a whole.

Feature space isn't defined by the data points, but by the kernel function itself. There's no guarantee that any particular kernel function will give linear separability. So, the choice of kernel is an important model selection problem.

In this new space, it may now be easier to separate out the data, because the new space actually represents similarities between datapoints, whereas in the original space, it just represented the real-world data.

Consider the linear kernel $K(x, x') = x cdot x'$ which simply computes the dot product in input space. This gives a feature space that's equivalent to input space (up to rotation and reflection, which preserve the dot product). So, an SVM with this kernel would be equivalent to a regular linear SVM, and the above statement can't be true–just framing things in terms of similarities between data points doesn't necessarily increase separability.

And the reason why this new space is considered to have potentially infinite dimensions, is that you can have one dimension of the new space for each datapoint in your data; so as your dataset approaches infinite size, the dimensionality of the new dataset also approaches infinity.

Feature space does *not* have one dimension per datapoint. For example, consider the linear kernel again. Feature space is equivalent to the input space, and therefore has the same number of dimensions. Polynomial kernels induce a finite dimensional feature space, with higher dimensionality than the input space (for degree > 1). Some kernels (such as the RBF kernel) do induce an infinite dimensional feature space. This just means that, in order for $Phi(x) cdot Phi(x')$ to equal $K(x, x')$, $Phi(x)$ must be infinite dimensional. This is a consequence of the kernel function, rather than the number of data points. However, even if feature space is infinite dimensional, the data can only span a finite-dimensional subspace. The maximum possible dimensionality of this subspace is the number of data points.

### Similar Posts:

- Solved – What are kernels in support vector machine?
- Solved – How does linear SVMs function in multi dimensional feature space
- Solved – How does linear SVMs function in multi dimensional feature space
- Solved – Is a polynomial kernel ridge regression really equivalent to performing a linear regression on those expanded features
- Solved – Dimension of weight vectors in SVMs