# Solved – Confidence intervals for the Log Loss metric for model comparison

Quite a few Kaggle competitions have used or are using the Logarithmic Loss metric as the quality measure of a submission.

I'm wondering if there are other ways besides N-fold cross-validation to calculate confidence intervals for this metric. If model X has a log loss of 0.123456 on the test set and model Y has a log loss of 0.123457, I'm sure you'll agree that model X is not significantly better than model Y, unless we're talking about a gazillion data points.

Why something else than N-fold cross-validation? Simple answer: performance. For a certain application I need to know whether model X is significantly better than model Y (when looking at Log Loss). In other words, I need to know whether the Log Loss for model X falls outside the 95% confidence interval for the Log Loss of model Y.

I need to do this comparison many, many times with different models and datasets that are coming in every day. Performance is crucial, so doing 10-fold cross-validation a 1,000 times to get a rough estimate of the confidence intervals is not going to cut it, I'm afraid. The datasets for which I have to calculate the log loss are usually in the range of say 50 positives and 10,000 negatives to 20,000 positives and 1 million negatives.

Contents

A reasonable way to estimate the confidence interval for `log_loss` metric is to assume that the model X is a perfect model, meaning that X gives true probabilities of output of classes for every sample in test set.

Consider two-class case. Then `log_loss` metric is an average of independent \$N\$ random variables, each one of which takes value \$log(p_j)\$ with probability \$p_j\$ and value \$log(1-p_j)\$ with probability \$(1-p_j)\$. Here \$N\$ is the size of test set.

It is theoretically possible to compute the resulting distribution of `log_loss` as a sum of random variables with known distributions analytically, but is quite challenging. It is easier to get a numerical approximation by Monte-Carlo procedure.

Here is a sketch of the algorithm:

1. Compute predictions \$p_j\$ of model X for each sample from test set.

2. Generate fake class labels for each sample \$p_j<u\$, where \$u\$ is uniform on \$(0..1)\$.

3. Compute `log_loss` using fake class labels

4. Repeat steps 2 and 3 many times.

5. Compute standard deviation of `log_loss` estimates over the repetitions.

6. Compute confidence interval estimation by multiplying standard deviation by a constant corresponding to desired confidence level.

I have evaluated this approach on a proprietary click through dataset by comparing estimated standard deviation to the standard deviation obtained by splitting test set into multiple independent subsets of equal size and computing `log_loss` standard deviation from them.

I have found that standard deviation of the `log_loss` depends on the number of samples in test set \$N\$ in an expected way: \$\$ std = frac{a}{sqrt N}, \$\$ where \$aapprox0.5\$ and depends mildly on the average click through rate.

From this a very rough estimate (maybe up to a factor of 2) for 95% confidence interval of `log_loss` evaluated on 10 million samples is \$pm 0.0003\$

For the recent Kaggle CTR contest the final standings looks like this: ` 1 0.3791384 2 0.3803652 3 0.3806351 4 0.3810307 `

Assuming 10 million records in the test set, I believe that first place score is significantly better than the second, but difference between the second and the third place may not be significant.

Rate this post