Solved – Named Entity Recognition using Keras and Tensorflow – Overfitting

I am trying to write a Named Entity Recognition model using Keras and Tensorflow.

I am training on a data that is has (Person,Products,Location,Others).
Amongst these entities, the dataset is imbalanced with "Others" entity being a majority class.

As a result when I run predictions on my neural network with I am getting results that are biased to the the "Others" entity.. I want to know how to reduce the model from overfitting. Any approaches? I am sharing code for your reference.


import numpy as np from keras.models import Sequential from keras.layers.recurrent import LSTM from keras.layers.core import Activation, Dense from keras.layers.wrappers import TimeDistributed from keras.preprocessing.sequence import pad_sequences from keras.layers.embeddings import Embedding from sklearn.cross_validation import train_test_split from sklearn.metrics import confusion_matrix, accuracy_score,  precision_recall_fscore_support  raw = open('taggedcontent.txt', 'r').readlines()  all_x = [] point = [] for line in raw: stripped_line = line.strip().split(' ') point.append(stripped_line) if line == 'n':     all_x.append(point[:-1])     point = [] all_x = all_x[:-1]  lengths = [len(x) for x in all_x] print ('Input sequence length range: ', max(lengths), min(lengths))  short_x = [x for x in all_x if len(x) < 3258]  X = [[c[0] for c in x] for x in short_x]  y = [[c[1] for c in y] for y in short_x]  all_text = [c for x in X for c in x]  words = list(set(all_text)) word2ind = {word: index for index, word in enumerate(words)}  ind2word = {index: word for index, word in enumerate(words)} labels = list(set([c for x in y for c in x])) label2ind = {label: (index + 1) for index, label in enumerate(labels)} ind2label = {(index + 1): label for index, label in enumerate(labels)} print ('Vocabulary size:', len(word2ind), len(label2ind))  maxlen = max([len(x) for x in X]) print ('Maximum sequence length:', maxlen)  def encode(x, n):   result = np.zeros(n)   result[x] = 1 return result  X_enc = [[word2ind[c] for c in x] for x in X]  max_label = max(label2ind.values()) + 1 y_enc = [[0] * (maxlen - len(ey)) + [label2ind[c] for c in ey] for ey in y] y_enc = [[encode(c, max_label) for c in ey] for ey in y_enc]  X_enc = pad_sequences(X_enc, maxlen=maxlen) y_enc = pad_sequences(y_enc, maxlen=maxlen)  X_train, X_test, y_train, y_test = train_test_split(X_enc, y_enc,   test_size=11*32, train_size=45*32, random_state=42) print ('Training and testing tensor shapes:', X_train.shape, X_test.shape,   y_train.shape, y_test.shape)  max_features = len(word2ind) embedding_size = 128 hidden_size = 32 out_size = len(label2ind) + 1 model = Sequential() model.add(Embedding(max_features, embedding_size, input_length=maxlen,  mask_zero=True)) model.add(LSTM(hidden_size, return_sequences=True))   model.add(TimeDistributed(Dense(out_size)))model.add(Activation('softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam')    batch_size = 32, y_train, batch_size=batch_size, nb_epoch=10,    validation_data=(X_test, y_test))     model_json = model.to_json()     with open("model.json", "w") as json_file:     json_file.write(model_json)     # serialize weights to HDF5     model.save_weights("model.h5")     print("Saved model to disk") 


 x = "Google Microsoft Trump Pepsi"  X = x..strip().split(' ')  words = list(set(all_text))  word2ind = {word: index for index, word in enumerate(words)}  X_enc = [[word2ind[c] for c in x] for x in X]  X_enc = pad_sequences(X_enc, maxlen=3258)  pred = model.predict_classes(X_enc) 

The predictions are completely biased to "Others" entity. Should I be creating a new word2ind in the predictions phase or should I be reusing it from the tra.ining phase. How can change this code to avoid overfitting and any suggestions.

I also want to be using a Reinforcement or Online learning at the end of this model, so that Misclassified instances are reinforced into the model to achieve an accurate model. How can I use reinforcement learning in NLP use cases? Help is appreciated.

Best Answer

Try adding a CRFLayer to the end of your model. you can find an implementation in Keras here.

Similar Posts:

Rate this post

Leave a Comment