PCA is used for dimensional reduction. I learned today that PCA cannot be used for nonlinear data. When nonlinear, you have to use kernel PCA (KPCA).

It seems that since KPCA is more applicable to more variety of data, it seems like a superior method, other than its longer computation time. Is KPCA truly superior to PCA, or are there any practical disadvantages of using KPCA when data is linear?

It seems that if computational efficiency isn't that big of a concern, I can just almost always naively use KPCA.

Contents

Kernel PCA (kPCA) actually includes regular PCA as a special case–they're equivalent if the linear kernel is used. But, they have different properties in general. Here are some points of comparison:

• Linear vs. nonlinear structure. kPCA can capture nonlinear structure in the data (if using a nonlinear kernel), whereas PCA cannot. This can be a big deal; if this capability is needed, PCA simply isn't an option and kPCA (or one of many other nonlinear dimensionality reduction techniques) must be used. If the data truly live on a linear manifold, kPCA can do no better than PCA. In fact, using a nonlinear kernel may give worse performance due to overfitting.

• Interpretability. PCA provides a set of weights that define a linear mapping from the input space to the low-dimensional embedding space. These weights can provide useful information about the structure of the data. No such weights exist for kPCA with nonlinear kernels because the mapping is nonlinear. It's also nonparametric.

• Inverse mapping. PCA provides an inverse mapping from the low-dimensional space back to the input space. So, input points can be approximately reconstructed from their low-dimensional images. kPCA doesn't inherently provide an inverse mapping, although it's possible to estimate one using additional methods (at the cost of extra complexity and computational resources).

• Hyperparameters. Both methods require choosing the number of dimensions. kPCA also requires choosing the kernel function and any associated parameters (e.g. the bandwidth of an RBF kernel, or degree of a polynomial kernel). This choice (and the method/criterion employed to make it) is problem dependent. Typically, one needs to re-fit kPCA multiple times to compare different kernel/parameter choices.

• Computational cost. PCA generally has lower memory and runtime requirements than kPCA, and can be scaled to massive datasets. Various strategies exist for scaling up kPCA, but this requires making approximations (e.g. see the Nystroem approximation).

As usual, the right tool for the job depends on the problem.

Rate this post