Solved – How to choose the best metric to measure the calibration

I program and do test-driven development. After I made a change in my code I run my tests. Sometimes they succeed and sometimes they fail. Before I run a test I write down a number from 0.01 to 0.99 for my credence that the test will succeed.

I want to know whether I'm improving in predicting whether my test will succeed or fail. It would also be nice if I can track whether I'm better at predicting whether the test will succeed on Mondays or on Fridays. If my ability to predict test success correlates with other metrics I track, I want to know.

That leaves me with the task of choosing the right metric.
In Superforcasting Philip Tetlock proposes to use the Brier score to measure how well experts are calibrated. Another metric that has been proposed in the literature is the Logarithmic scoring rule. There are also other possible candidates.

How do I decide which metric to use? Is there an argument for favoring one scoring rule over the others?

I assume that you are doing unit-tests for your code.

One idea that I can think of, which would maybe not do exactly what you want, is to use a linear model.

The benefit of doing that, is that you can create a bunch of other variables that you can include in the analysis.

Let's say that you have a vector $mathbf{Y}$ which includes the outcome of your tests, and another vector $mathbf{x}$ that includes your predictions of the outcome.

Now you can simply fit the linear model

$$ y_i = a + bx_i +epsilon $$

and find the value of $b$, the higher the value of $b$ would indicate that your predictions are becoming better.

The thing that makes this approach nice is that now you can start to add a bunch of other variables to see if that creates a better model, and those variables can help in making better predictions. The variables could be an indicator for the day of the week, e.g. for Monday it would always be 1, and zero for all the other days. If you include that variable in the model, you would get:

$$ y_i = a + a_{text{Monday}} + bx_i +epsilon $$

And if the variable $a_{text{Monday}}$ is significant and positive, then it could mean that you are more conservative in your predictions on Mondays.

You could also create a new variable where you give a score to assess the difficulty of the task you performed. If you have version control, then you could e.g. use the number of lines of code as difficulty, i.e. the more code you write, the more likely something will break.

Other variables could be, number of coffee cups that day, indicator for upcoming deadlines, meaning there is more stress to finish stuff etc.

You can also use a time variable to see if your predictions are getting better. Also, how long you spent on the task, or how many sessions you have spent on it, whether you were doing a quick fix and it might be sloppy etc.

In the end you have a prediction model, where you can try to predict the likelihood of success. If you manage to create this, then maybe you do not even have to make your own predictions, you can just use all the variables and have a pretty good guess on whether things will work.

The thing is that you only wanted a single number. In that case you can use the simple model I presented in the beginning and just use the slope, and redo the calculations for each period, then you can look if there is a trend in that score over time.

Hope this helps.

Similar Posts:

Rate this post

Leave a Comment