# Solved – Classification with partially “unknown” data

Suppose I want to learn a classifier that takes a vector of numbers as input, and gives a class label as output. My training data consists of a large number of input-output pairs.

However, when I come to testing on some new data, this data is typically only partially complete. For example if the input vector is of length 100, only 30 of the elements might be given values, and the rest are "unknown".

As an example of this, consider image recognition where it is known that part of the image is occluded. Or consider classification in a general sense where it is known that part of the data is corrupt. In all cases, I know exactly which elements in the data vector are the unknown parts.

I'm wondering how I can learn a classifier that would work for this kind of data? I could just set the "unknown" elements to a random number, but given that there are often more unknown elements than known ones, this does not sound like a good solution. Or, I could randomly change elements in the training data to "unknown", and train with these rather than the complete data, but this might require exhaustive sampling of all combinations of known and unknown elements.

In particular I am thinking about neural networks, but I am open to other classifiers.

Contents

I think there's a reasonable way to make it work with Neural Networks.

Let your value for unknown be 0. Now in training you pick an input and randomly put some of its values to 0 with probability \$p\$, where p is your expected fraction of missing inputs at test time. Note that the same input at different iterations will have 0s at different positions.

I haven't seen it done before but this would be very similar to doing Dropout (a well known regularization method in Neural Networks) in your input neurons, instead of the hidden neurons. I don't think it's a good idea to do it in general, but if you're forced to (like your case), at least it's close enough theoretically to something that's been known to work.

Rate this post

# Solved – Classification with partially “unknown” data

Suppose I want to learn a classifier that takes a vector of numbers as input, and gives a class label as output. My training data consists of a large number of input-output pairs.

However, when I come to testing on some new data, this data is typically only partially complete. For example if the input vector is of length 100, only 30 of the elements might be given values, and the rest are "unknown".

As an example of this, consider image recognition where it is known that part of the image is occluded. Or consider classification in a general sense where it is known that part of the data is corrupt. In all cases, I know exactly which elements in the data vector are the unknown parts.

I'm wondering how I can learn a classifier that would work for this kind of data? I could just set the "unknown" elements to a random number, but given that there are often more unknown elements than known ones, this does not sound like a good solution. Or, I could randomly change elements in the training data to "unknown", and train with these rather than the complete data, but this might require exhaustive sampling of all combinations of known and unknown elements.

In particular I am thinking about neural networks, but I am open to other classifiers.