Quite a few Kaggle competitions have used or are using the Logarithmic Loss metric as the quality measure of a submission.
I'm wondering if there are other ways besides N-fold cross-validation to calculate confidence intervals for this metric. If model X has a log loss of 0.123456 on the test set and model Y has a log loss of 0.123457, I'm sure you'll agree that model X is not significantly better than model Y, unless we're talking about a gazillion data points.
Why something else than N-fold cross-validation? Simple answer: performance. For a certain application I need to know whether model X is significantly better than model Y (when looking at Log Loss). In other words, I need to know whether the Log Loss for model X falls outside the 95% confidence interval for the Log Loss of model Y.
I need to do this comparison many, many times with different models and datasets that are coming in every day. Performance is crucial, so doing 10-fold cross-validation a 1,000 times to get a rough estimate of the confidence intervals is not going to cut it, I'm afraid. The datasets for which I have to calculate the log loss are usually in the range of say 50 positives and 10,000 negatives to 20,000 positives and 1 million negatives.
What would you advise?
Best Answer
A reasonable way to estimate the confidence interval for log_loss
metric is to assume that the model X is a perfect model, meaning that X gives true probabilities of output of classes for every sample in test set.
Consider two-class case. Then log_loss
metric is an average of independent $N$ random variables, each one of which takes value $log(p_j)$ with probability $p_j$ and value $log(1-p_j)$ with probability $(1-p_j)$. Here $N$ is the size of test set.
It is theoretically possible to compute the resulting distribution of log_loss
as a sum of random variables with known distributions analytically, but is quite challenging. It is easier to get a numerical approximation by Monte-Carlo procedure.
Here is a sketch of the algorithm:
Compute predictions $p_j$ of model X for each sample from test set.
Generate fake class labels for each sample $p_j<u$, where $u$ is uniform on $(0..1)$.
Compute
log_loss
using fake class labelsRepeat steps 2 and 3 many times.
Compute standard deviation of
log_loss
estimates over the repetitions.Compute confidence interval estimation by multiplying standard deviation by a constant corresponding to desired confidence level.
I have evaluated this approach on a proprietary click through dataset by comparing estimated standard deviation to the standard deviation obtained by splitting test set into multiple independent subsets of equal size and computing log_loss
standard deviation from them.
I have found that standard deviation of the log_loss
depends on the number of samples in test set $N$ in an expected way: $$ std = frac{a}{sqrt N}, $$ where $aapprox0.5$ and depends mildly on the average click through rate.
From this a very rough estimate (maybe up to a factor of 2) for 95% confidence interval of log_loss
evaluated on 10 million samples is $pm 0.0003$
For the recent Kaggle CTR contest the final standings looks like this: 1 0.3791384 2 0.3803652 3 0.3806351 4 0.3810307
Assuming 10 million records in the test set, I believe that first place score is significantly better than the second, but difference between the second and the third place may not be significant.
Similar Posts:
- Solved – Better accuracy with validation set than test set
- Solved – confusion regarding confidence interval of normal distribution
- Solved – Representation of standard deviation in statistical range
- Solved – Calculate Mean and Standard Deviation, when given the confidence interval and sample size
- Solved – Calculate Mean and Standard Deviation, when given the confidence interval and sample size