I'm having a hard time understanding some on the inner-working of YOLO, especially the loss function depicted in this seminal paper. Bear in mind that I'm nowhere closed to being a specialist in deep learning and computer vision (in fact, I started studying the subject two weeks ago, so I do have some shortcomings).
So what I've understood so far : we want to build a neural network that, given an image, predicts bounding boxes and classes probability for those bounding boxes. Therefore, given some input image, our network will output some tensor of size $S * S * (B * 5 + C)$ (see the paper mentioned above).
In order to do that, we need to train our network on some dataset (e.g. Pascal VOC for instance), backpropagating our loss measurement through the all network to optimize our weights through gradient descent (to put in
nutshell : the usual stuff !).
The loss function given by Redmon et al. is the following :
$begin{align*}
color{blue}{lambda_textbf{coord}
sum_{i = 0}^{S^2}
sum_{j = 0}^{B}
{mathbb{1}}_{ij}^{text{obj}}
left[
left(
x_i – hat{x}_i
right)^2 +
left(
y_i – hat{y}_i
right)^2
right]
\\
+ lambda_textbf{coord}
sum_{i = 0}^{S^2}
sum_{j = 0}^{B}
{mathbb{1}}_{ij}^{text{obj}}
left[
left(
sqrt{w_i} – sqrt{hat{w}_i}
right)^2 +
left(
sqrt{h_i} – sqrt{hat{h}_i}
right)^2
right]}
\color{green}{
+ sum_{i = 0}^{S^2}
sum_{j = 0}^{B}
{mathbb{1}}_{ij}^{text{obj}}
left(
C_i – hat{C}_i
right)^2
\
+ lambda_textrm{noobj}
sum_{i = 0}^{S^2}
sum_{j = 0}^{B}
{mathbb{1}}_{ij}^{text{noobj}}
left(
C_i – hat{C}_i
right)^2
\} color{red}{
+ sum_{i = 0}^{S^2}
{{1}}_i^{text{obj}}
sum_{c in textrm{classes}}
left(
p_i(c) – hat{p}_i(c)
right)^2}
end{align*}
$
I've highlighted in blue the first part of our loss function, which we could called coordinate loss. I totally understand this part as it is quite intuitive and natural.
My problem is that I'm not sure to quite understand the last terms of this function. Here some of my guesses :
The red part : since our neural network should output classes probabilities $p(c_i | object)$ for bounding boxes, those $(p(c_i|object))_{i = 1 dots C}$ should indeed be considered as parameters and therefore it should make sense that they appear in the loss function. So my question would be: since we are training on some labeled dataset, it means that $p_i(c)$ should be zero except for one class $c$, right ? (it is deterministic! )
The green part : I'm a little bit lost about that one. I mean, since $widehat{C_i}$ is defined as the confidence of grid cell $i$, that is $widehat{C_i} = Pr(object) times IOU_{pred}^{truth}$, and $C_i$ is defined to be the 'real' confidence score (that is $IOU_{pred}^{truth}$), it seems silly as this everything is deterministic (we're on the training set), then we know whether or not there is an object in our grid cell, therefore $Pr(object) in {0,1}$…I mean, I don't really know how to explain, but I'm confused with this loss function, especially with the real sense of confidence score in the model.
Thanks a lot for your help!
EDIT : maybe a first question would be : why are we interested in confidence score? At the end of the neural net, do we have some decision algorithm that says : if this bounding box as confidence above threshold $c_0$ then displays it and choose class with highest probability ?
Best Answer
Basically, yolo combines detection and classification into one loss function: the green part corresponds to whether or not any object is there, while the red part corresponds to encouraging correctly determining which object is there, if one is present.
Since we are training on some labeled dataset, it means that $p_i(c)$ should be zero except for one class $c$, right?
Yes. Notice we are only penalizing the network when there is indeed an object present. But if your question is whether $p_i(c)in{0,1}$, then usually yes, that is how it is done.
Why are we interested in confidence score? At the end of the neural net, do we have some decision algorithm that says: if this bounding box as confidence above threshold $c_0$ then displays it and choose class with highest probability?
Usually, yes, a threshold is needed exactly as you describe. Often it is a hyper-parameter that can be chosen or cross-validate over.
As for your other questions about the "confidence" score, I must agree that the nomenclature is confusing. There are two "viewpoints" one can have about this: (1) a probabilistic confidence measure of whether any object exists in the locale, and (2) a deterministic prediction of the overlap between the local predicted bounding box $hat{B}$ and the ground truth one $B$. Both outlooks are often conflated, and in some sense can be treated as "equivalent", since we can view $|Bcap hat{B}|/|Bcuphat{B}|in[0,1]$ as a probability.
As an aside, there are already a couple other discussions of the yolo loss: