Solved – How to learn labels with unsupervised learning

In, I came across the sentence:

Similarly, you can use unsupervised learning to learn labels for your data, then use those labels for supervised learning.

I have never heard of the bolded part before. How exactly do you use unsupervised learning to "learn labels" for labeless data?

Normally, you don't (and you don't believe everything someone writes somewhere on the internet).

What the writer probably meant (at least that's my interpretation) is that you can use clustering to identify the clusters, declare each cluster to be a class for itself, and use these "classes" to learn class boundaries or other rules for "classifying" new data.

This approach, however, is likely to suffer from severe generalisation issues, if it works at all. If the true classes overlap, clustering won't be able to identify them and the clusters will not correspond to the classes. Even if the clusters/classes are well separated, lack of true labels will prevent you from tuning hyperparameters and ensuring good generalisation. So, it is a theoretically possible concept, but unlikely to work in practice.

I also stumbled over the preceding sentence in the blog you quoted:

An income prediction task can be regression if we output raw numbers, but if we quantize the income into different brackets and predict the bracket, it becomes a classification problem.

Again, it is theoretically possible, but not a recommended approach. By treating income prediction as a classification task we ignore (lose information about) the similarity between different "classes". The bracket [20,000 – 30,000] is closer to the bracket [30,000 – 40,000] than to [150,000 – 200,000]. Classification wouldn't take this into account. See my answer here for more details.

Similar Posts:

Rate this post

Leave a Comment