Solved – What are the benefits for semi-supervised learning over unsupervised clustering? Or any limitations

I have another question about semi-supervised learning vs unsupervised clustering, what are the benefits and limitations?

I have got some data with labels and some without labels. I performed semi-supervised learning (using SVM classifier) for the classification task.

Also, I compared with the results of using unsupervised clustering (hierarchical clustering). I found that:

For the labeled data, the clustering results are quite similar to the cross-validation performance of the semi-supervised learning.

However, for the unlabeled data, the clustering results are not as good as the further prediction of the trained SVM classifier (using labeled data as aforementioned) according to some qualitative checking (visually check the classified images).

Howe to interpret these findings? Does this mean that semi-supervised learning method is superior to unsupervised learning? And why?


If semi-supervised learning didn't fail badly, semi-supervised results must be better than unsupervised learning (unless you are overfitting etc.) – at least when using a supervised evaluation.

Not having/using training label information does not have a chance against knowing part of the objective… it literally means ignoring the essential part of the data. A (semi-) supervised method tries to maximize your evaluation measure – an unsupervised method cannot do this, because it doesn't have this data. It's unfair to evaluate unsupervised algorithms against supervised.

In fact, for a classification task, you must be very lucky if clustering results somewhat correspond to your classes. Most of the time, clustering structure yields a different structure than the classes you have, because there may be more than one interesting structure. Classes may overlap (and thus be one cluster) or split into more than one cluster. Very often, the clusters discovered are "correct" but not "interesting" because the algorithms tend to emphasize simple structure (say, gender; which may well be an attribute in your data already) because it is just stronger than what you are looking for.

Clustering is not a replacement for classification. Classification is for prediction, clustering is for data exploration. I.e. you will need to have a domain expert study the clustering result to decide if the structure is useful. Do not attempt to put clustering algorithms into "production" – you cannot.

Similar Posts:

Rate this post

Leave a Comment