Text classification CNN nan loss - tensorflow

I'm attempting to train a neural network for text classification (sarcasm detection on reddit comments). I have the comment itself, its parent comment, and the subreddit in which these comments were made.
I engineer features such as the positivity, negativity, and neutrality of the comment, the same for the parent comment, and engineer 2 more numerical features based on the subreddit. This is a total of 8 engineered features.
The neural network I use is structured as such: I run a bunch of filters over my (word-embedding translated) comment, a bunch of filters over my parent comment (word-embedding translated), using a 1D CNN. I subsequently do some pooling and feed the outputs to a fully connected layer.
This works fine.
However, I wish to add in those 8 engineered features to my inputs to the fully connected layer. But when I do so, the loss decreases (rather quickly), and suddenly I see the loss turn to nan and accuracy decrease.
Below is the code within my function to produce the model: -
embedding_layer = Embedding(num_words,
embedding_dim,
weights=[embeddings],
input_length=max_sequence_length,
trainable=False)
parent_embedding_layer = Embedding(parent_num_words,
embedding_dim,
weights=[parent_embeddings],
input_length=max_sequence_length,
trainable=False)
sequence_input = Input(shape=(2, 205))
// some tensor manipulation
comment_sequence_input:= length 200 tensor of words in comment
parent_sequence_input:= length 200 tensor of words in comment
non_text_comment_input:= length 5 tensor with 5 engineered features
non_text_parent_input:= length 5 tensor with 5 engineered features
embedded_sequences = embedding_layer(comment_sequence_input)
parent_embedded_sequences = parent_embedding_layer(parent_sequence_input)
convs = []
// Convolutions
for filter_size in filter_sizes:
l_conv = Conv1D(filters=filters, kernel_size=filter_size,
activation='relu')(embedded_sequences)
l_pool = GlobalMaxPooling1D()(l_conv)
convs.append(l_pool)
for filter_size in parent_filter_sizes:
parent_l_conv = Conv1D(filters=parent_filters, kernel_size=filter_size,
activation='relu')(parent_embedded_sequences)
parent_l_pool = GlobalMaxPooling1D()(parent_l_conv)
convs.append(parent_l_pool)
// End of convolutions
//Inclusion of engineered features
convs.append(non_text_comment_input)
convs.append(non_text_parent_input)
l_merge = concatenate(convs, axis=1)
// end of section
x = Dropout(0.30)(l_merge)
// Start of fully connected layer
x = Dense(128, activation='relu')(x)
x = Dense(64, activation='relu')(x)
preds = Dense(labels_index, activation='sigmoid')(x)
adam_optimizer = Adam(learning_rate=0.000001)
model = Model(sequence_input, preds)
model.compile(loss='binary_crossentropy',
optimizer=adam_optimizer,
metrics=['acc'])
model.summary()
return model
The input provided is a (batch_dim, 2, 205) size tensor. 200 for the number of words in each entry, and 5 for the 3 sentiment engineered features and 2 subreddit engineered features. The 2 dimensions are for comment and parent comment.
Things I have tried:
Lowering the learning rate
Adding dropout
Normalising inputs
Normalising inputs and scaling them down by a factor of 100
Different optimizers (RMSprop, Nadam, etc)
Adding regularization (bias reg, kernel reg, activity reg) of various levels to my layers.
I tried testing this by setting all my engineered feature values to 0, and the network trains perfectly fine. Not sure what I should do here.

Related

How to avoid overfitting in CNN?

I'm making a model for predicting the age of people by analyzing their face. I'm using this pretrained model, and maked a custom loss function and a custom metrics. So I obtain discrete result but I want to improve it. In particular, I noticed that after some epochs the model begin to overfitt on the training set then the val_loss increases. How can I avoid this? I'm already using Dropout, but this doesn't seem to be enough.
I think maybe I should use l1 and l2 but I don't know how.
def resnet_model():
model = VGGFace(model = 'resnet50')#model :{resnet50, vgg16, senet50}
xl = model.get_layer('avg_pool').output
x = keras.layers.Flatten(name='flatten')(xl)
x = keras.layers.Dense(4096, activation='relu')(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.Dense(4096, activation='relu')(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.Dense(11, activation='softmax', name='predictions')(x)
model = keras.engine.Model(model.input, outputs = x)
return model
model = resnet_model()
initial_learning_rate = 0.0003
epochs = 20; batch_size = 110
num_steps = train_x.shape[0]//batch_size
learning_rate_fn = tf.keras.optimizers.schedules.PiecewiseConstantDecay(
[3*num_steps, 10*num_steps, 16*num_steps, 25*num_steps],
[1e-4, 1e-5, 1e-6, 1e-7, 5e-7]
)
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate_fn)
model.compile(loss=custom_loss, optimizer=optimizer, metrics=['accuracy', one_off_accuracy])
model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, validation_data=(test_x, test_y))
This is an example of result:
There are many regularization methods to help you avoid overfitting your model:
Dropouts:
Randomly disables neurons during the training, in order to force other neurons to be trained as well.
L1/L2 penalties:
Penalizes weights that change dramatically. This tries to ensure that all parameters will be equally taken into consideration when classifying an input.
Random Gaussian Noise at the inputs:
Adds random gaussian noise at the inputs: x = x + r where r is a random normal value from range [-1, 1]. This will confuse your model and prevent it from overfitting into your dataset, because in every epoch, each input will be different.
Label Smoothing:
Instead of saying that a target is 0 or 1, You can smooth those values (e.g. 0.1 & 0.9).
Early Stopping:
This is a quite common technique for avoiding training your model too much. If you notice that your model's loss is decreasing along with the validation's accuracy, then this is a good sign to stop the training, as your model begins to overfit.
K-Fold Cross-Validation:
This is a very strong technique, which ensures that your model is not fed all the time with the same inputs and is not overfitting.
Data Augmentations:
By rotating/shifting/zooming/flipping/padding etc. an image you make sure that your model is forced to train better its parameters and not overfit to the existing dataset.
I am quite sure there are also more techniques to avoid overfitting. This repository contains many examples of how the above techniques are deployed in a dataset:
https://github.com/kochlisGit/Tensorflow-State-of-the-Art-Neural-Networks
You can try incorporate image augmentation in your training, which increases the "sample size" of your data as well as the "diversity" as #Suraj S Jain mentioned. The official tutorial is here: https://www.tensorflow.org/tutorials/images/data_augmentation

How to get hidden state matrices out of stacked BiLSTM layer in Tensorflow Keras?

I am trying to write code for this architecture (Question Answering model: Paper https://www.hindawi.com/journals/cin/2019/9543490/) and looking for help how to get hidden state matrices Hq and Ha from stacked BiLSTM layers. Could some one please advise.
# Creating Embedding Layer for Query
# Considered fixed length as 40 for both question and answer as per research paper
embedding_layer1 = layers.Embedding(vocab_size_query, 300, weights=[embedding_matrix_query], input_length =40, trainable=False)
input_text1 =Input(shape=(40,), name="input_text")
x = embedding_layer1(input_text1)
# Creating Bidirectional layer for Query
# Each word in the context and question should be made aware of the nearby words occurring. We use a bi-directional recurrent neural network (LSTM’s) here.
x = Bidirectional(LSTM(128,recurrent_dropout=0.5,kernel_regularizer=regularizers.l2(0.001),return_sequences=True))(x)
x = Bidirectional(LSTM(128,recurrent_dropout=0.5,kernel_regularizer=regularizers.l2(0.001),return_sequences=True))(x)
flatten_1 = Flatten()(x)
## Creating Embedding Layer for Passage
embedding_layer2 = layers.Embedding(vocab_size_answer, 300, weights=[embedding_matrix_answer], input_length =40, trainable=False)
input_text2 =Input(shape=(40,), name="input_text")
x2 = embedding_layer2(input_text2)
# Creating Bidirectional layer for Passage
x2 = Bidirectional(LSTM(128,recurrent_dropout=0.5,kernel_regularizer=regularizers.l2(0.001),return_sequences=True))(x2)
x2 = Bidirectional(LSTM(128,recurrent_dropout=0.5,kernel_regularizer=regularizers.l2(0.001),return_sequences=True))(x2)
flatten_2 = Flatten()(x2)
According to the model structure and your source code, you can obtain the Hq and Ha by extracting the output of flatten_1 and flatten_2 layer. To extract the output of an intermediate layer, you can create a new model with input as the original input, and the output as the appropriate layer.
from tensorflow.keras.models import Model
model = ... # create the original model
layer_name = 'my_layer'
intermediate_layer_model = Model(inputs=model.input,
outputs=model.get_layer(layer_name).output)
intermediate_output = intermediate_layer_model.predict(data)

Recurrent Neural Network Mini-Batch dependency after trained

Currently, I have a neural network, built in tensorflow that is used to classify time sequence data into one of 6 categories. The network is composed of:
2 fully connected layers -> LSTM unit -> softmax -> output
All layers have regularization in the form of dropout and or layer normalization. In order to speed up the training process, I am using mini-batching of the data, where the mini-batch size = # of categories = 6. Each mini-batch contains exactly one sample for each of the 6 categories, arranged randomly in the mini-batch. Below is the feed-forward code, where x is of shape [batch_size, number of time steps, number of features], and the various get commands are simple definitions for creating standard fully connected layers and LSTM units with regularization.
def getFullyConnected(input ,hidden ,dropout, layer, phase):
weight = tf.Variable(tf.random_normal([input.shape.dims[1].value,hidden]), name="weight_layer"+str(layer))
bias = tf.Variable(tf.random_normal([1]), name="bias_layer"+str(layer))
layer = tf.add(tf.matmul(input, weight), bias)
layer = tf.contrib.layers.batch_norm(layer,
center=True, scale=True,
is_training=phase)
layer = tf.minimum(tf.nn.relu(layer), FLAGS.relu_clip)
layer = tf.nn.dropout(layer, (1.0 - dropout))
return layer
def RNN(x, weights, biases, time_steps):
#shape the input as [batch_size*time_steps, input_depth]
x = tf.reshape(x, [-1,input_depth])
layer1 = getFullyConnected(input=x, hidden=16, dropout=full_drop, layer=1, phase=True)
layer2 = getFullyConnected(input=layer1, hidden=input_depth*3, dropout=full_drop, layer=2, phase=True)
rnn_input = tf.reshape(layer2, [-1,time_steps,input_depth*3])
# 1-layer LSTM with n_hidden units.
LSTM_cell = getLSTMcell(n_hidden)
#generate prediction
outputs, state = tf.nn.dynamic_rnn(LSTM_cell,
rnn_input,
dtype=tf.float32,
time_major=False)
#good old tensorboard saves
tf.summary.histogram('weight', weights['out'])
tf.summary.histogram('bias',biases['out'])
#there are time_steps outputs, but only grab the last output for the classification
return tf.sigmoid(tf.matmul(outputs[:,-1,:], weights['out']) + biases['out'])
Surprisingly, this network trained extremely well giving me about 99.75% accuracy on my test data (which the trained network had never seen). However, it only scored this high when I fed the training data into the network with a mini-batch size the same as during training, 6. If I only fed the training data one sample at a time (mini-batch size = 1), the network was scoring around 60%. What is weird is that, if I train the network with only single samples (mini-batch size = 1), the trained network works perfectly fine with high accuracy once the network is trained. This leads me to the weird conclusion that the network is almost learning to utilize the batch size in its learning, so much so that it becomes dependent on the mini-batch to classify correctly.
Is it a thing for a deep network to become dependent on the size of the mini-batch during training, so much that the final trained network will require input data to have the same mini-batch size just to perform correctly?
All ideas or thoughts would be loved!

LSTM with Condition

I'm studying LSTM with CNN in tensorflow.
I want to put some scalar label into LSTM network as a condition.
Does anybody know which LSTM is what I meant?
If available, please let me know the usage of that
Thank you.
This thread might interest you: Adding Features To Time Series Model LSTM.
You have basically 3 possible ways:
Let's take an example with weather data from two different cities: Paris and San Francisco. You want to predict the next temperature based on historical data. But at the same time, you expect the weather to change based on the city. You can either:
Combine the auxiliary features with the time series data, at the beginning or at the end (ugly!).
Concatenate the auxiliary features with the output of the RNN layer. It's some kind of post-RNN adjustment since the RNN layer won't see this auxiliary info.
Or just initialize the RNN states with a learned representation of the condition (e.g. Paris or San Francisco).
I wrote a library to condition on auxiliary inputs. It abstracts all the complexity and has been designed to be as user-friendly as possible:
https://github.com/philipperemy/cond_rnn/
The implementation is in tensorflow (>=1.13.1) and Keras.
Hope it helps!
Heres an example of applying CNN and LSTM over the output probabilities of a sequence, like you asked:
def build_model(inputs):
BATCH_SIZE = 4
NUM_CLASSES = 2
NUM_UNITS = 128
H = 224
W = 224
C = 3
TIME_STEPS = 4
# inputs is assumed to be of shape (BATCH_SIZE, TIME_STEPS, H, W, C)
# reshape your input such that you can apply the CNN for all images
input_cnn_reshaped = tf.reshape(inputs, (-1, H, W, C))
# define CNN, for instance vgg 16
cnn_logits_output, _ = vgg_16(input_cnn_reshaped, num_classes=NUM_CLASSES)
cnn_probabilities_output = tf.nn.softmax(cnn_logits_output)
# reshape back to time series convention
cnn_probabilities_output = tf.reshape(cnn_probabilities_output, (BATCH_SIZE, TIME_STEPS, NUM_CLASSES))
# perform LSTM over the probabilities per image
cell = tf.contrib.rnn.LSTMCell(NUM_UNITS)
_, state = tf.nn.dynamic_rnn(cell, cnn_probabilities_output)
# employ FC layer over the last state
logits = tf.layers.dense(state, NUM_UNITS)
# logits is of shape (BATCH_SIZE, NUM_CLASSES)
return logits
By the way, a better approach would be to employ the LSTM over the last hidden layer, i.e to use the CNN as feature extractor and make the prediction over sequences of features.

Tensorflow lstm for sentiment analysis not learning. UPDATED

UPDATED:
i'm building a Neural Network for my final project and i need some help with it.
I'm trying to build a rnn to do sentiment analysis over Spanish text. I have about 200,000 labeled tweets and i vectorized them using a word2vec with a Spanish embedding
Dataset & Vectorization:
I erased duplicates and split the dataset into training and testing sets.
Padding, unknown and end of sentence tokens are applied when vectorizing.
I mapped the #mentions to known names in the word2vec model. Example: #iamthebest => "John"
My model:
My data tensor has shape = (batch_size, 20, 300).
I have 3 classes: neutral, positive and negative, so my target tensor has shape = (batch_size, 3)
I use BasicLstm cells and dynamic rnn to build the net.
I use Adam Optimizer, and softmax_cross entropy for the loss calculation
I use a dropout wrapper to decrease the overfitting.
Last run:
I have tried with different configurations and non of them seem to work.
Last setup: 2 Layers, 512 batch size, 15 epochs and 0.001 of lr.
Weak points for me:
im worried about the final layer and the handing of the final state in the dynamic_rnn
Code:
# set variables
num_epochs = 15
tweet_size = 20
hidden_size = 200
vec_size = 300
batch_size = 512
number_of_layers= 1
number_of_classes= 3
learning_rate = 0.001
TRAIN_DIR="/checkpoints"
tf.reset_default_graph()
# Create a session
session = tf.Session()
# Inputs placeholders
tweets = tf.placeholder(tf.float32, [None, tweet_size, vec_size], "tweets")
labels = tf.placeholder(tf.float32, [None, number_of_classes], "labels")
# Placeholder for dropout
keep_prob = tf.placeholder(tf.float32)
# make the lstm cells, and wrap them in MultiRNNCell for multiple layers
def lstm_cell():
cell = tf.contrib.rnn.BasicLSTMCell(hidden_size)
return tf.contrib.rnn.DropoutWrapper(cell=cell, output_keep_prob=keep_prob)
multi_lstm_cells = tf.contrib.rnn.MultiRNNCell([lstm_cell() for _ in range(number_of_layers)], state_is_tuple=True)
# Creates a recurrent neural network
outputs, final_state = tf.nn.dynamic_rnn(multi_lstm_cells, tweets, dtype=tf.float32)
with tf.name_scope("final_layer"):
# weight and bias to shape the final layer
W = tf.get_variable("weight_matrix", [hidden_size, number_of_classes], tf.float32, tf.random_normal_initializer(stddev=1.0 / math.sqrt(hidden_size)))
b = tf.get_variable("bias", [number_of_classes], initializer=tf.constant_initializer(1.0))
sentiments = tf.matmul(final_state[-1][-1], W) + b
prob = tf.nn.softmax(sentiments)
tf.summary.histogram('softmax', prob)
with tf.name_scope("loss"):
# define cross entropy loss function
losses = tf.nn.softmax_cross_entropy_with_logits(logits=sentiments, labels=labels)
loss = tf.reduce_mean(losses)
tf.summary.scalar("loss", loss)
with tf.name_scope("accuracy"):
# round our actual probabilities to compute error
accuracy = tf.to_float(tf.equal(tf.argmax(prob,1), tf.argmax(labels,1)))
accuracy = tf.reduce_mean(tf.cast(accuracy, dtype=tf.float32))
tf.summary.scalar("accuracy", accuracy)
# define our optimizer to minimize the loss
with tf.name_scope("train"):
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)
#tensorboard summaries
merged_summary = tf.summary.merge_all()
logdir = "tensorboard/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") + "/"
writer = tf.summary.FileWriter(logdir, session.graph)
# initialize any variables
tf.global_variables_initializer().run(session=session)
# Create a saver for writing training checkpoints.
saver = tf.train.Saver()
# load our data and separate it into tweets and labels
train_tweets = np.load('data_es/train_vec_tweets.npy')
train_labels = np.load('data_es/train_vec_labels.npy')
test_tweets = np.load('data_es/test_vec_tweets.npy')
test_labels = np.load('data_es/test_vec_labels.npy')
**HERE I HAVE THE LOOP FOR TRAINING AND TESTING, I KNOW ITS FINE**
I have already solved my problem. After reading some papers and more trial and error, I figured out what my mistakes were.
1) Dataset: I had a large dataset, but I didn't format it properly.
I checked the distribution of tweet labels (Neutral, Positive and Negative), realized there was a disparity in the distribution of said tweets and normalized it.
I cleaned it up even more by erasing url hashtags and unnecessary punctuation.
I shuffled prior to vectorization.
2) Initialization:
I initialized the MultiRNNCell with zeros and I changed my custom final layer to tf.contrib.fully_connected. I also added the initialization of the bias and weight matrix. (By fixing this, I started to see better loss and accuracy plots in Tensorboard)
3) Dropout:
I read this paper, Recurrent Dropout without Memory Loss, and I changed my dropouts accordingly; I started seeing improvements in the loss and accuracy.
4) Decaying the learning rate:
I added an exponential decaying rate after 10,000 steps to control over-fitting.
Final results:
After applying all of these changes, I achieved a test accuracy of 84%, which is acceptable because my data set still sucks.
My final network config was:
num_epochs = 20
tweet_size = 20
hidden_size = 400
vec_size = 300
batch_size = 512
number_of_layers= 2
number_of_classes= 3
start_learning_rate = 0.001