Suppose we measure the classifier error on a test set and obtain a certain success rate – say, 75%. Now, of course, this is only one measurement – how to calculate the "true" success rate? Sure it will be close to 75% but how close?

I understand it's related to confidence intervals but now I'm lost in confidence intervals. I think my example is similar to this one on wikipedia where they look at weight distribution of margarine cups. (Sorry, math is not rendering here so I created a screenshot – you might also want to flick through the corresponding section in the wikipedia article).

I have the following questions:

- Why they use the above standard error formula?
- Where does this Ф^{-1}(0.975)=1.96 come from?
- To solve my "true success rate" problem, should I repeat the estimation N times and then apply the same reasoning as they do with margarine cups?

**Contents**hide

#### Best Answer

Under the assumption that your data is normally distributed, then the standard error can be used as it is the error that is expected from normally distributed data with the same expectation (mean).

We are interested in how many samples fall into the "tails" of the distribution – i.e. how many samples fall outside of a certain range. $alpha$ here is the confidence interval – i.e. if we set $alpha = 0.95$ then this defines the boundaries of where 95% of the data should lie, in ideal circumstances. We the use the inverse CDF $phi^-1$ to calculate what these boundaries are. This is also called the "Q-function", and can be expressed in terms of the error function as:

$Q(x) =tfrac{1}{2} – tfrac{1}{2} operatorname{erf} Bigl( frac{x}{sqrt{2}} Bigr)=tfrac{1}{2}operatorname{erfc}(frac{x}{sqrt{2}}).$ (hopefully maths will render soon!)

This is available in matlab. The calculation required is `2*(1-erfcinv(0.975))`

or `1-erfcinv(0.95)`

since $Q(x) = 1-phi(x)$

- This is actually related to another question that I asked. The answer would be yes
*if*you expect the classification scores to be normally distributed. However I'm not sure this is true – you might expect the scores to be biased towards 1 (if you're using accuracy) and almost certainly not symmetric (i.e. skewed). As given by one of the answers to my question, perhaps something like McNemar's test might be useful, although that's really for comparing classifiers. I guess the best you can do for a single classifier is provide the mean and standard deviation of many train/test splits, as is common practice in research papers.