Quite a few Kaggle competitions have used or are using the Logarithmic Loss metric as the quality measure of a submission.

I'm wondering if there are other ways besides N-fold cross-validation to calculate confidence intervals for this metric. If model X has a log loss of 0.123456 on the test set and model Y has a log loss of 0.123457, I'm sure you'll agree that model X is not significantly better than model Y, unless we're talking about a gazillion data points.

Why something else than N-fold cross-validation? Simple answer: performance. For a certain application I need to know whether model X is significantly better than model Y (when looking at Log Loss). In other words, I need to know whether the Log Loss for model X falls outside the 95% confidence interval for the Log Loss of model Y.

I need to do this comparison many, many times with different models and datasets that are coming in every day. Performance is crucial, so doing 10-fold cross-validation a 1,000 times to get a rough estimate of the confidence intervals is not going to cut it, I'm afraid. The datasets for which I have to calculate the log loss are usually in the range of say 50 positives and 10,000 negatives to 20,000 positives and 1 million negatives.

What would you advise?

**Contents**hide

#### Best Answer

A reasonable way to estimate the confidence interval for `log_loss`

metric is to assume that the model X is a *perfect* model, meaning that X gives true probabilities of output of classes for every sample in test set.

Consider two-class case. Then `log_loss`

metric is an average of independent $N$ random variables, each one of which takes value $log(p_j)$ with probability $p_j$ and value $log(1-p_j)$ with probability $(1-p_j)$. Here $N$ is the size of test set.

It is theoretically possible to compute the resulting distribution of `log_loss`

as a sum of random variables with known distributions analytically, but is quite challenging. It is easier to get a numerical approximation by Monte-Carlo procedure.

Here is a sketch of the algorithm:

Compute predictions $p_j$ of model X for each sample from test set.

Generate fake class labels for each sample $p_j<u$, where $u$ is uniform on $(0..1)$.

Compute

`log_loss`

using fake class labelsRepeat steps 2 and 3 many times.

Compute standard deviation of

`log_loss`

estimates over the repetitions.Compute confidence interval estimation by multiplying standard deviation by a constant corresponding to desired confidence level.

I have evaluated this approach on a proprietary click through dataset by comparing estimated standard deviation to the standard deviation obtained by splitting test set into multiple independent subsets of equal size and computing `log_loss`

standard deviation from them.

I have found that standard deviation of the `log_loss`

depends on the number of samples in test set $N$ in an expected way: $$ std = frac{a}{sqrt N}, $$ where $aapprox0.5$ and depends mildly on the average click through rate.

From this a very rough estimate (maybe up to a factor of 2) for 95% confidence interval of `log_loss`

evaluated on 10 million samples is $pm 0.0003$

For the recent Kaggle CTR contest the final standings looks like this: ` 1 0.3791384 2 0.3803652 3 0.3806351 4 0.3810307 `

Assuming 10 million records in the test set, I believe that first place score is *significantly* better than the second, but difference between the second and the third place may not be significant.

### Similar Posts:

- Solved – Better accuracy with validation set than test set
- Solved – confusion regarding confidence interval of normal distribution
- Solved – Representation of standard deviation in statistical range
- Solved – Calculate Mean and Standard Deviation, when given the confidence interval and sample size
- Solved – Calculate Mean and Standard Deviation, when given the confidence interval and sample size