I am trying to write a Named Entity Recognition model using Keras and Tensorflow.
I am training on a data that is has (Person,Products,Location,Others).
Amongst these entities, the dataset is imbalanced with "Others" entity being a majority class.
As a result when I run predictions on my neural network with I am getting results that are biased to the the "Others" entity.. I want to know how to reduce the model from overfitting. Any approaches? I am sharing code for your reference.
TRAINING PHASE
import numpy as np from keras.models import Sequential from keras.layers.recurrent import LSTM from keras.layers.core import Activation, Dense from keras.layers.wrappers import TimeDistributed from keras.preprocessing.sequence import pad_sequences from keras.layers.embeddings import Embedding from sklearn.cross_validation import train_test_split from sklearn.metrics import confusion_matrix, accuracy_score, precision_recall_fscore_support raw = open('taggedcontent.txt', 'r').readlines() all_x = [] point = [] for line in raw: stripped_line = line.strip().split(' ') point.append(stripped_line) if line == 'n': all_x.append(point[:-1]) point = [] all_x = all_x[:-1] lengths = [len(x) for x in all_x] print ('Input sequence length range: ', max(lengths), min(lengths)) short_x = [x for x in all_x if len(x) < 3258] X = [[c[0] for c in x] for x in short_x] y = [[c[1] for c in y] for y in short_x] all_text = [c for x in X for c in x] words = list(set(all_text)) word2ind = {word: index for index, word in enumerate(words)} ind2word = {index: word for index, word in enumerate(words)} labels = list(set([c for x in y for c in x])) label2ind = {label: (index + 1) for index, label in enumerate(labels)} ind2label = {(index + 1): label for index, label in enumerate(labels)} print ('Vocabulary size:', len(word2ind), len(label2ind)) maxlen = max([len(x) for x in X]) print ('Maximum sequence length:', maxlen) def encode(x, n): result = np.zeros(n) result[x] = 1 return result X_enc = [[word2ind[c] for c in x] for x in X] max_label = max(label2ind.values()) + 1 y_enc = [[0] * (maxlen - len(ey)) + [label2ind[c] for c in ey] for ey in y] y_enc = [[encode(c, max_label) for c in ey] for ey in y_enc] X_enc = pad_sequences(X_enc, maxlen=maxlen) y_enc = pad_sequences(y_enc, maxlen=maxlen) X_train, X_test, y_train, y_test = train_test_split(X_enc, y_enc, test_size=11*32, train_size=45*32, random_state=42) print ('Training and testing tensor shapes:', X_train.shape, X_test.shape, y_train.shape, y_test.shape) max_features = len(word2ind) embedding_size = 128 hidden_size = 32 out_size = len(label2ind) + 1 model = Sequential() model.add(Embedding(max_features, embedding_size, input_length=maxlen, mask_zero=True)) model.add(LSTM(hidden_size, return_sequences=True)) model.add(TimeDistributed(Dense(out_size)))model.add(Activation('softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam') batch_size = 32 model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=10, validation_data=(X_test, y_test)) model_json = model.to_json() with open("model.json", "w") as json_file: json_file.write(model_json) # serialize weights to HDF5 model.save_weights("model.h5") print("Saved model to disk")
VALIDATION PHASE
x = "Google Microsoft Trump Pepsi" X = x..strip().split(' ') words = list(set(all_text)) word2ind = {word: index for index, word in enumerate(words)} X_enc = [[word2ind[c] for c in x] for x in X] X_enc = pad_sequences(X_enc, maxlen=3258) pred = model.predict_classes(X_enc)
The predictions are completely biased to "Others" entity. Should I be creating a new word2ind in the predictions phase or should I be reusing it from the tra.ining phase. How can change this code to avoid overfitting and any suggestions.
I also want to be using a Reinforcement or Online learning at the end of this model, so that Misclassified instances are reinforced into the model to achieve an accurate model. How can I use reinforcement learning in NLP use cases? Help is appreciated.
Best Answer
Try adding a CRFLayer to the end of your model. you can find an implementation in Keras here.
Similar Posts:
- Solved – Understanding TFLearn IMDB sentiment analysis
- Solved – Once trained, is it normal that LSTM-based neural nets output different values even though the input is the same
- Solved – Evaluation metric for named entity recognition
- Solved – Word2Vec and PyTorch – am I approaching this correctly
- Solved – Does it make sense to do Cross Validation with a Small Sample