Solved – ny situation where PCA performs better than SVD

It's for a text clustering application. There are around 25 documents and 50k features (from TF-IDF), so I was expecting SVD to be a better choice.

I am using sklearn's PCA and TruncatedSVD functions before I cluster the data. I am using Hierarchical Clustering (ward linkage), also from sklearn.

PCA seems to result in more sensible clusters, it's unsupervised so I am not entirely sure but when visualise it, the clusters formed after PCA seem to be tighter and far apart.

PCA can be implemented as SVD on the covariance matrix.

SVD is more general, and can also e.g. be applied to the distance or similarity matrix.

If you have traditional point data from continuous distributions in Euclidean spaces, then PCA will usually work better. In particular, the results are much better interpretable.

On "encoded" data that is not of numerical nature (such as text), neither PCA nor SVD makes a lot of sense numerically. SVD makes a little bit more sense than PCA (covariance is pretty much nonsense on sparse data), but will usually not yield to any major benefits. You can make it work, but it's hard to draw many benefits from this. If you are lucky, you may at least get out a usable visualization, that's all.

In your case (25 documents), SVD on the distance matrix (in O(n^3), truncated O(n^2)) will be cheaper than PCA (which is O(d^3), truncated O(d^2)). But on real data problems with a million instances, SVD on the distance matrix is impossible.

You said you are supervised. I would then try to use supervised projection techniques. Because your data is sparse, make sure to use non-metric approaches, they make more sense.

Don't focus on what you can make "run". Focus on what makes sense to use.

Similar Posts:

Rate this post

Leave a Comment