I understand the math in kernel PCA and with RBF kernel, and I also understand that the RBF kernel map the data into a infinite dimensional space.

I know that for SVM, mapping the data into a higher dimensional space will always make it easier to linearly separate the data from different classes.

But what I don't understand is why mapping data into a higher dimensional space (say infinite using RBF kernel) will be more suitable to use linear PCA in the feature space? Is the performance always guaranteed to improve as compare to PCA when applied to nonlinear data?

I tried to visualize the process but it is hard for me when I use the RBF kernel. I don't physically or intuitively think mapping the data into higher-dimensional space will always work for PCA purpose (for nonlinear data) . Is there any concrete example of explanation about what actually (not just illustrative plot) happens and why it helps to use RBF kernel? And is there any way to characterize how data distributed in the feature space by using RBF kernel?

Thanks!

Notes: I know that the theory and applications are well-established for classification purpose, but I don't know for KPCA as a dimensionality reduction technique (or distribution modeling) purpose, and that's the main aim of this question.

**Contents**hide

#### Best Answer

But what I don't understand is why mapping data into a higher dimensional space (say infinite using RBF kernel) will be more suitable to use linear PCA in the feature space? Is the performance always guaranteed to improve as compare to PCA when applied to nonlinear data?

It might be useful if you're interested in classification with linear method, or using clustering – in the higher dimensional space data might be easier to classify linearly, or clusters might be easier to separate. For a simplified, theoretical version, you could check out the Cover's theorem.

I tried to visualize the process but it is hard for me when I use the RBF kernel.

That's not really surprising considering the fact that the embedding space is infinitely dimensional.

I don't physically or intuitively think mapping the data into higher-dimensional space will always work for PCA purpose (for nonlinear data) .

That's because it doesn't have to work well – I remember one time trying polynomial kernels with SVMs for something that should be in theory solvable using them (text classification, polynomial kernels sort of work like boolean expressions for One-hot encoded vectors), but in fact the method was worse than using linear kernel.

Is there any concrete example of explanation about what actually (not just illustrative plot) happens and why it helps to use RBF kernel?

I don't know if it's possible to formulate it for PCA, but for classification with SVM there exist estimates of VC dimension (this effectively measures how complex decision functions classifier can handle). Some example facts can be found in these slides (see "VC Dimension of Support Vector Machines").