Solved – A method for propagating labels to unlabelled data

I have a large set of data and a small subset is labelled as being in class 'A' and the rest is unlabelled. I know that some of the unlabelled data should also be labelled 'A'. In order to label some more of the data my idea is to do the following:

  1. Build a classifier on the whole data set separating the class 'A from the unlabelled data.
  2. Run the classifier on the unlabelled data.
  3. Add the unlabelled items classified as being in class 'A' to class 'A'.
  4. Repeat.

There are lots of parts that are unclear and/or problematic such as when to stop and how exactly to set the thresholds for when to accept something as being in class 'A'.

Is a method like this known already in the literature so that I can gain some ideas for how to do it properly?

Learning from positive and unlabeled data is often referred to as PU learning. what you describe is a common approach to these kinds of problems, though I personally dislike such iterative approaches because they are highly sensitive to false positives (if you have any).

You might want to check out two of my papers and references therein for an up-to-date overview on current research for these problems:

The first paper describes a state-of-the-art method to learn classifiers and the second is the only approach that allows you to estimate any performance metric based on contingency tables from test sets without known negatives (you read that right).

Both papers also provide a good overview of the existing literature on this subject.

Similar Posts:

Rate this post

Leave a Comment