I haven't even finished 1 epoch, so I don't think it could any sort of overfitting. I am training on a very large amount of data (27 gb of text) so it'll still be a while before I even reach one epoch.

The loss now has been increasing for twice as long as the loss had been decreasing, although the loss is still overall smaller since the increase is at a smaller rate.

If it helps, my architecture is Bert, with 2 extra layers fully connected layers after Bert. I am using triplet loss via softmax/cross entropy.

**Contents**hide

#### Best Answer

Triplet models are notoriously tricky to train. Before starting a triplet loss project, I strongly recommend reading "FaceNet: A Unified Embedding for Face Recognition and Clustering" by Florian Schroff, Dmitry Kalenichenko, James Philbin because it outlines some of the key problems that arise when using triplet losses, as well as suggested remediations. In my experience, their tips and tricks provide enormous improvements to model training, both in terms of performance against a test set as well as wall-time consumed to train the model. In summary, the authors make several suggestions, but we need to motivate them.

Let's start by defining the problem. The goal of triplet loss is to find an embedding such that $$ left|f(x^a_i) – f(x^p_i) right|_2^2+alpha < left|f(x_i^a)-f(x_i^n)right|_2^2 forall left(f(x_i^a),f(x_i^p),f(x_i^n)right)inmathcal{T} tag{*}$$ where $mathcal{T}$ is the set of all possible triplets. A triplet is composed of an anchor point, a positive point (same class as the anchor), and a negative point (distinct class from the anchor).

Clearly, iterating over *all possible triplets* becomes enormously expensive when the data set is even moderately sized.

The loss is zero when the inequality $(*)$ holds, and becomes larger the more that this inequality is violated, giving us the loss function

$$begin{aligned} L &= sum_i maxleft{0, left|f(x^a_i) – f(x^p_i) right|_2^2 – left|f(x_i^a)-f(x_i^n)right|_2^2 +alpharight} \ &= sum_i text{ReLU}left(left|f(x^a_i) – f(x^p_i) right|_2^2 – left|f(x_i^a)-f(x_i^n)right|_2^2 +alpharight). end{aligned} $$

# My hypothesis of your observed behavior.

My understanding is that you're composing triplets by selecting points at random when constructing a triplet. After even a little training, it's usually the case that model arranges the classes well enough that the loss for a randomly-selected triplet is typically small or even zero (but not for all triplets). Counter-intuitively, this isn't helpful, because if the training losses are zero, there's no information available to adjust the weights. Instead, we want to focus on the triplets with the most information; these are the so-called hard triplets. This explains why the loss initially decreases, as well as explaining why you observe large swings in the loss value: most triplets become easy after a little training, but some triplets are hard.

Additionally, I believe you're seeing large swings in the loss value because the minibatch size is small.

This bring us to the first tip from the paper.

# Focus on the hardest triplets.

Instead of composing a triplet at random, use *online hard-negative mining* to choose the triplets with the highest loss.

We want to search for these hard triplets *online* because which triplets are hard depends on their embeddings, which depend on the model parameters. In other words, the set of triplets labeled "hard" will probably change as the model trains.

So, within a batch, compare all of the distances and construct the triplets with the where the anchor-negative distance $ left|f(x_i^a)-f(x_i^n)right|_2^2 $ is the *smallest*. This is *online mining* because you're computing the batch and then picking which triplets to compare. It's *hard negative mining* because you're choosing the smallest anchor-negative distance. (By contrast, batch-hard mining chooses the hardest negative and the hardest positive. The hardest positive has the *largest* $left|f(x^a_i) – f(x^p_i) right|_2^2$. Batch-hard mining is an even harder task because both the positives and negatives are hardest.)

By construction, we know that the loss for all non-hard triplets must be smaller because hard triplets are characterized by having the largest losses. This means that the numerical values of hard mining will tend to be larger compared to other methods of choosing triplets.

This bring us to the second suggestion.

# Use large batch sizes.

Because online hard negative mining looks for the largest losses amongst all possible triplets in a batch, using a large batch is helpful because the value of those maxima is larger in expectation. This is an obvious result of order statistics: appending more draws to a sample will produce a maximum that's *at least* as large. The FaceNet paper uses batch sizes of 1000. Increasing the batch size increases the difficulty of the task.

As additional justification for large batch sizes consider that we would like to make all triplet comparisons to find the hardest triplets at each step of computing the loss. However, because $|mathcal{T}|$ is large, this is typically infeasible. So instead, we will look for the hard samples inside each mini-batch, for some large mini-batch size. This will tend to result in easier triplets compared to the hardest triplets within the entire data set, but is a necessary compromise to make feasible training models on large datasets.

This brings us to the third suggestion.

# Start with *semi-hard* negative mining.

If we start training the model with online hard negative mining, the loss tends to just get stuck at a high value and not decrease. If we first train with semi-hard negative mining, and then switch to online hard negative mining, the model tends to do better.

Semi-hard negative mining has the same goal as $(*)$, but instead of focusing on *all* triplets in $mathcal{T}$, it only looks to the triplets that *already satisfy the a specific ordering*: $$ left|f(x^a_i) – f(x^p_i) right|_2^2 < left|f(x^a_i) – f(x^n_i) right|_2^2 < alpha, $$ and then choosing the *hardest negative* that satisfies this criterion. The semi-hard loss tends to quickly decrease to very small values because the underlying task is easier. The points are already ordered correctly, and any points which aren't ordered that way are ignored.

I think of this as a certain kind of supervised pre-training of the model: sort the negatives that are within the margin of the anchors so that the online batch hard loss task has a good starting point.

# Look out for a collapsed model

Triplet models are susceptible to mapping each input to the same point. When this happens, the distances in $(*)$ go to zero, the loss gets stuck at $alpha$ and the model is basically done updating. Semi-hard negative mining can also help prevent this from happening.

In my experience, the loss tending towards $alpha$ is a clear signal that training isn't working as desired and the embeddings are not informative. You can check whether this is the case by examining the embedding vectors: if the classes tend to be close together, there's a problem.

# I'm not sure you want to softmax your embeddings.

The FaceNet authors project their outputs to the unit sphere, i.e. the embedding vectors are constrained to unit length. This is because if we allow the embedding vectors to have any length, then the simple fact that data in high dimensions is spread out makes it easy to satisfy the desired inequality $(*)$.

Choosing a unit sphere projection implies that the largest distance between two points must be twice the radius, i.e. 2. The choice of $alpha$ is likewise strongly linked to this spherical projection. The FaceNet authors don't write about how they chose $alpha=0.2$ at all, but my guess is they experimented and found this value yielded nice results. 🤷

Choosing softmax for your embeddings means that the embeddings have $L^1$ unit-length instead of $L^2$ unit length, and each element is non-negative. It seems like this is a much stronger restriction than projecting to a sphere, and I wonder whether it will produce the desired result. Likewise, it might mean that you need to be careful about choosing $alpha$, since the largest possible distance between embeddings is different.

# Putting it all together

First, train with semi-hard negative mining. Then online hard negative mining. I've found modest gains from further training with online batch hard mining, but usually this improvement is entirely realized from the first epoch of online batch hard mining, and second and later epochs are basically flat. Furthermore, you can also increase the difficulty of the task by increasing the batch size, so you might start with sizes of 500, increase it to 1000 and then 2000 after some number of epochs. This might help to eke out larger gains.

# Track the hardest loss throughout

Changing the losses changes the tasks, so comparing the value of semi-hard loss to batch hard loss is like comparing apples to oranges. Because of how semi-hard loss is defined, its value will always be smaller than ordinary triplet loss. But we still want to achieve the inequality $(*)$! To make a consistent comparison as training progresses, you should measure the loss on the hardest task throughout training to confirm that the model is, indeed, improving as you change tasks during training.

*Caveat*: I don't know how or whether the use of BERT (or other Sesame Street models) in conjunction with triplet losses will change this analysis. I haven't used these models as extensively. However, because triplet loss is so tricky to use, my recommendation is starting there.

### Similar Posts:

- Solved – Can you brief the training procedure of SSD in TF Object Detection API
- Solved – How does FaceNet (Google’s facerecognition) handles a new image
- Solved – Strange batch loss in keras
- Solved – Calculation of relative distance
- Solved – How total loss is manipulated in mini batch gradient descent as the loss function is calculated and minimized for mini batches