Say I have a model $y=f_n(x_1,x_2,x_3)$.

Here say $y$ is categorical and binomial response. i.e. $y$ can be only 0 or 1. Data shows 87% 1 and 13% 0 values.

I fit a multinomial logit on a test dataset to try to predict $y$. Validation against a validation dataset showed 70% success. (Success meaning Model prediction matches $y$ in the validation set).

Normally I'd be happy with the 70% success rate. But I'm a bit confused because: Say, I had a blackbox model that always answered "$y=1$" no matter what $x_1$, $x_2$ and $x_3$ are. Wouldn't this model anyways achieve 87% success?

What's the point behind my multinomial logit unless I can beat 87%?

What gives? Am I making a blunder?

**Contents**hide

#### Best Answer

You are right that performance of trivial models (always predicting 1, or always 0 or uniformly random guessing) are important benchmarks. However, there may be slightly different trivial benchmark models that reflect better the particular way how the model was built.

You should follow Wayne's advise to break down the successes and failures for the two classes to find out what is going on.

However, yo[u also need to be aware of the uncertainty on such performance estimates. Search here for "proportion confidence interval".

Here are some ideas what else could have happened

- Of course, your model may truly be worse than the trivial model.
- Some models train for equal prior probabilities, i.e. regardless of the relative frequency in the training data, the model adjusts to be optimal for equal relative frequencies for the different classes.

If that is the case, from the model's point of view, the trivial models would all have 50% correct predicitions, and the model is better. - Some models discard "duplicates" in the training input.
- You say that "the data" has 87% of cases of one class. Does that apply to the training data or to the test data? (Kind of the opposite situation from the point before)
- Some classification problems are ill-posed (typically a "negative" class has only the definition that it is not the "defined" class. For example, having a particular disease vs. not having that disease, but possibly any other disease). In that case, the worse than trivial benchmark performance may occur if certain kinds of samples are not correctly classified.
- (probably not relevant here) Other kinds of adjustment with the relative frequencies (e.g. one class is known to be difficult to recognize) can lead to all kinds of performance compared to the trivial benchmarks.
- (not relevant here) in multiclass classification, particular classes can be much more difficult to recognize than others, so worse than guessing/trivial prediction can happen.
- … many more possibilities…