I have read the BERT paper for NLP (https://arxiv.org/abs/1810.04805), and trying to understand the keras implementation. Here's my code to load the BERT model:

`import keras from keras_bert import get_base_dict, get_model, compile_model, gen_batch_inputs model = get_model( token_num=30000, head_num=12, transformer_num=12, embed_dim=768, feed_forward_dim=3072, seq_len=500, pos_num=512, dropout_rate=0.05 ) compile_model(model) model.summary() `

Note that all the parameters I used are the default of BERT Base implementation. In the keras model summary, I could see that there are 2,362,368 trainable parameters at each of the multi-head self attention layer. But I don't understand how to get this number.

There are total 12 attention heads, and in each head, there are *Q*, *K*, *V* vectors each with dimension 768 × 768.

So the dimension should have been (768 × 768 × 3 + 768 × 3) × 12 including biases = only 1,771,776 × 12.

But the actual number 2,362,368 seems to be equal to (768 × 768 × 4 + 768 × 4).

Can someone explain how I can account for this number?

**Contents**hide

#### Best Answer

After doing the multi-head attention, you have 12 heads context vectors of dimension 768 and you need to project them back to the model dimension, this gives you another 12 × 768 × 768 + 768 parameters. In addition, there is layer normalization with 2 × 768 parameters.