I have read the BERT paper for NLP (https://arxiv.org/abs/1810.04805), and trying to understand the keras implementation. Here's my code to load the BERT model:
import keras from keras_bert import get_base_dict, get_model, compile_model, gen_batch_inputs model = get_model( token_num=30000, head_num=12, transformer_num=12, embed_dim=768, feed_forward_dim=3072, seq_len=500, pos_num=512, dropout_rate=0.05 ) compile_model(model) model.summary()
Note that all the parameters I used are the default of BERT Base implementation. In the keras model summary, I could see that there are 2,362,368 trainable parameters at each of the multi-head self attention layer. But I don't understand how to get this number.
There are total 12 attention heads, and in each head, there are Q, K, V vectors each with dimension 768 × 768.
So the dimension should have been (768 × 768 × 3 + 768 × 3) × 12 including biases = only 1,771,776 × 12.
But the actual number 2,362,368 seems to be equal to (768 × 768 × 4 + 768 × 4).
Can someone explain how I can account for this number?
Best Answer
After doing the multi-head attention, you have 12 heads context vectors of dimension 768 and you need to project them back to the model dimension, this gives you another 12 × 768 × 768 + 768 parameters. In addition, there is layer normalization with 2 × 768 parameters.