Solved – How to account for the no:of parameters in the Multihead self-Attention layer of BERT

I have read the BERT paper for NLP (, and trying to understand the keras implementation. Here's my code to load the BERT model:

import keras from keras_bert import get_base_dict, get_model, compile_model, gen_batch_inputs  model = get_model(     token_num=30000,     head_num=12,     transformer_num=12,     embed_dim=768,     feed_forward_dim=3072,     seq_len=500,     pos_num=512,     dropout_rate=0.05 ) compile_model(model) model.summary() 

Note that all the parameters I used are the default of BERT Base implementation. In the keras model summary, I could see that there are 2,362,368 trainable parameters at each of the multi-head self attention layer. But I don't understand how to get this number.

There are total 12 attention heads, and in each head, there are Q, K, V vectors each with dimension 768 × 768.

So the dimension should have been (768 × 768 × 3 + 768 × 3) × 12 including biases = only 1,771,776 × 12.

But the actual number 2,362,368 seems to be equal to (768 × 768 × 4 + 768 × 4).

Can someone explain how I can account for this number?

After doing the multi-head attention, you have 12 heads context vectors of dimension 768 and you need to project them back to the model dimension, this gives you another 12 × 768 × 768 + 768 parameters. In addition, there is layer normalization with 2 × 768 parameters.

Similar Posts:

Rate this post

Leave a Comment