Tensorflow lstm for sentiment analysis not learning. UPDATED - tensorflow

UPDATED:
i'm building a Neural Network for my final project and i need some help with it.
I'm trying to build a rnn to do sentiment analysis over Spanish text. I have about 200,000 labeled tweets and i vectorized them using a word2vec with a Spanish embedding
Dataset & Vectorization:
I erased duplicates and split the dataset into training and testing sets.
Padding, unknown and end of sentence tokens are applied when vectorizing.
I mapped the #mentions to known names in the word2vec model. Example: #iamthebest => "John"
My model:
My data tensor has shape = (batch_size, 20, 300).
I have 3 classes: neutral, positive and negative, so my target tensor has shape = (batch_size, 3)
I use BasicLstm cells and dynamic rnn to build the net.
I use Adam Optimizer, and softmax_cross entropy for the loss calculation
I use a dropout wrapper to decrease the overfitting.
Last run:
I have tried with different configurations and non of them seem to work.
Last setup: 2 Layers, 512 batch size, 15 epochs and 0.001 of lr.
Weak points for me:
im worried about the final layer and the handing of the final state in the dynamic_rnn
Code:
# set variables
num_epochs = 15
tweet_size = 20
hidden_size = 200
vec_size = 300
batch_size = 512
number_of_layers= 1
number_of_classes= 3
learning_rate = 0.001
TRAIN_DIR="/checkpoints"
tf.reset_default_graph()
# Create a session
session = tf.Session()
# Inputs placeholders
tweets = tf.placeholder(tf.float32, [None, tweet_size, vec_size], "tweets")
labels = tf.placeholder(tf.float32, [None, number_of_classes], "labels")
# Placeholder for dropout
keep_prob = tf.placeholder(tf.float32)
# make the lstm cells, and wrap them in MultiRNNCell for multiple layers
def lstm_cell():
cell = tf.contrib.rnn.BasicLSTMCell(hidden_size)
return tf.contrib.rnn.DropoutWrapper(cell=cell, output_keep_prob=keep_prob)
multi_lstm_cells = tf.contrib.rnn.MultiRNNCell([lstm_cell() for _ in range(number_of_layers)], state_is_tuple=True)
# Creates a recurrent neural network
outputs, final_state = tf.nn.dynamic_rnn(multi_lstm_cells, tweets, dtype=tf.float32)
with tf.name_scope("final_layer"):
# weight and bias to shape the final layer
W = tf.get_variable("weight_matrix", [hidden_size, number_of_classes], tf.float32, tf.random_normal_initializer(stddev=1.0 / math.sqrt(hidden_size)))
b = tf.get_variable("bias", [number_of_classes], initializer=tf.constant_initializer(1.0))
sentiments = tf.matmul(final_state[-1][-1], W) + b
prob = tf.nn.softmax(sentiments)
tf.summary.histogram('softmax', prob)
with tf.name_scope("loss"):
# define cross entropy loss function
losses = tf.nn.softmax_cross_entropy_with_logits(logits=sentiments, labels=labels)
loss = tf.reduce_mean(losses)
tf.summary.scalar("loss", loss)
with tf.name_scope("accuracy"):
# round our actual probabilities to compute error
accuracy = tf.to_float(tf.equal(tf.argmax(prob,1), tf.argmax(labels,1)))
accuracy = tf.reduce_mean(tf.cast(accuracy, dtype=tf.float32))
tf.summary.scalar("accuracy", accuracy)
# define our optimizer to minimize the loss
with tf.name_scope("train"):
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)
#tensorboard summaries
merged_summary = tf.summary.merge_all()
logdir = "tensorboard/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") + "/"
writer = tf.summary.FileWriter(logdir, session.graph)
# initialize any variables
tf.global_variables_initializer().run(session=session)
# Create a saver for writing training checkpoints.
saver = tf.train.Saver()
# load our data and separate it into tweets and labels
train_tweets = np.load('data_es/train_vec_tweets.npy')
train_labels = np.load('data_es/train_vec_labels.npy')
test_tweets = np.load('data_es/test_vec_tweets.npy')
test_labels = np.load('data_es/test_vec_labels.npy')
**HERE I HAVE THE LOOP FOR TRAINING AND TESTING, I KNOW ITS FINE**

I have already solved my problem. After reading some papers and more trial and error, I figured out what my mistakes were.
1) Dataset: I had a large dataset, but I didn't format it properly.
I checked the distribution of tweet labels (Neutral, Positive and Negative), realized there was a disparity in the distribution of said tweets and normalized it.
I cleaned it up even more by erasing url hashtags and unnecessary punctuation.
I shuffled prior to vectorization.
2) Initialization:
I initialized the MultiRNNCell with zeros and I changed my custom final layer to tf.contrib.fully_connected. I also added the initialization of the bias and weight matrix. (By fixing this, I started to see better loss and accuracy plots in Tensorboard)
3) Dropout:
I read this paper, Recurrent Dropout without Memory Loss, and I changed my dropouts accordingly; I started seeing improvements in the loss and accuracy.
4) Decaying the learning rate:
I added an exponential decaying rate after 10,000 steps to control over-fitting.
Final results:
After applying all of these changes, I achieved a test accuracy of 84%, which is acceptable because my data set still sucks.
My final network config was:
num_epochs = 20
tweet_size = 20
hidden_size = 400
vec_size = 300
batch_size = 512
number_of_layers= 2
number_of_classes= 3
start_learning_rate = 0.001

Related

Text classification CNN nan loss

I'm attempting to train a neural network for text classification (sarcasm detection on reddit comments). I have the comment itself, its parent comment, and the subreddit in which these comments were made.
I engineer features such as the positivity, negativity, and neutrality of the comment, the same for the parent comment, and engineer 2 more numerical features based on the subreddit. This is a total of 8 engineered features.
The neural network I use is structured as such: I run a bunch of filters over my (word-embedding translated) comment, a bunch of filters over my parent comment (word-embedding translated), using a 1D CNN. I subsequently do some pooling and feed the outputs to a fully connected layer.
This works fine.
However, I wish to add in those 8 engineered features to my inputs to the fully connected layer. But when I do so, the loss decreases (rather quickly), and suddenly I see the loss turn to nan and accuracy decrease.
Below is the code within my function to produce the model: -
embedding_layer = Embedding(num_words,
embedding_dim,
weights=[embeddings],
input_length=max_sequence_length,
trainable=False)
parent_embedding_layer = Embedding(parent_num_words,
embedding_dim,
weights=[parent_embeddings],
input_length=max_sequence_length,
trainable=False)
sequence_input = Input(shape=(2, 205))
// some tensor manipulation
comment_sequence_input:= length 200 tensor of words in comment
parent_sequence_input:= length 200 tensor of words in comment
non_text_comment_input:= length 5 tensor with 5 engineered features
non_text_parent_input:= length 5 tensor with 5 engineered features
embedded_sequences = embedding_layer(comment_sequence_input)
parent_embedded_sequences = parent_embedding_layer(parent_sequence_input)
convs = []
// Convolutions
for filter_size in filter_sizes:
l_conv = Conv1D(filters=filters, kernel_size=filter_size,
activation='relu')(embedded_sequences)
l_pool = GlobalMaxPooling1D()(l_conv)
convs.append(l_pool)
for filter_size in parent_filter_sizes:
parent_l_conv = Conv1D(filters=parent_filters, kernel_size=filter_size,
activation='relu')(parent_embedded_sequences)
parent_l_pool = GlobalMaxPooling1D()(parent_l_conv)
convs.append(parent_l_pool)
// End of convolutions
//Inclusion of engineered features
convs.append(non_text_comment_input)
convs.append(non_text_parent_input)
l_merge = concatenate(convs, axis=1)
// end of section
x = Dropout(0.30)(l_merge)
// Start of fully connected layer
x = Dense(128, activation='relu')(x)
x = Dense(64, activation='relu')(x)
preds = Dense(labels_index, activation='sigmoid')(x)
adam_optimizer = Adam(learning_rate=0.000001)
model = Model(sequence_input, preds)
model.compile(loss='binary_crossentropy',
optimizer=adam_optimizer,
metrics=['acc'])
model.summary()
return model
The input provided is a (batch_dim, 2, 205) size tensor. 200 for the number of words in each entry, and 5 for the 3 sentiment engineered features and 2 subreddit engineered features. The 2 dimensions are for comment and parent comment.
Things I have tried:
Lowering the learning rate
Adding dropout
Normalising inputs
Normalising inputs and scaling them down by a factor of 100
Different optimizers (RMSprop, Nadam, etc)
Adding regularization (bias reg, kernel reg, activity reg) of various levels to my layers.
I tried testing this by setting all my engineered feature values to 0, and the network trains perfectly fine. Not sure what I should do here.

tensorflow, compute gradients with respect to weights that come from two models (encoder, decoder)

I have a encoder model and a decoder model (RNN).
I want to compute the gradients and update the weights.
I'm somewhat confused by what I've seen so far on the web.
Which block is the best practice? Is there any difference between the two options? Gradients seems to converge faster in Block 1, I do not know why?
# BLOCK 1, in two operations
encoder_gradients,decoder_gradients = tape.gradient(loss,[encoder_model.trainable_variables,decoder_model.trainable_variables])
myoptimizer.apply_gradients(zip(encoder_gradients,encoder_model.trainable_variables))
myoptimizer.apply_gradients(zip(decoder_gradients,decoder_model.trainable_variables))
# BLOCK 2, in one operation
gradients = tape.gradient(loss,encoder_model.trainable_variables + decoder_model.trainable_variables)
myoptimizer.apply_gradients(zip(gradients,encoder_model.trainable_variables +
decoder_model.trainable_variables))
You can manually verify this.
First, let's simplify the model. Let the encoder and decoder both be a single dense layer. This is mostly for simplicity and you can print out the weights being applying the gradients, gradients and weights after applying the gradients.
import tensorflow as tf
import numpy as np
from copy import deepcopy
# create a simple model with one encoder and one decoder layer.
class custom_net(tf.keras.Model):
def __init__(self):
super().__init__()
self.encoder = tf.keras.layers.Dense(3, activation='relu')
self.decoder = tf.keras.layers.Dense(3, activation='relu')
def call(self, inp):
return self.decoder(self.encoder(inp))
net = model()
# create dummy input/output
inp = np.random.randn(1,1)
gt = np.random.randn(3,1)
# set persistent to true since we will be accessing the gradient 2 times
with tf.GradientTape(persistent=True) as tape:
out = custom_model(inp)
loss = tf.keras.losses.mean_squared_error(gt, out)
# get the gradients as mentioned in the question
enc_grad, dec_grad = tape.gradient(loss,
[net.encoder.trainable_variables,
net.decoder.trainable_variables])
gradients = tape.gradient(loss,
net.encoder.trainable_variables + net.decoder.trainable_variables)
First, let's use a stateless optimizer like SGD which updates the weights based on the following formula and compare it to the 2 approaches mentioned in the question.
new_weights = weights - learning_rate * gradients.
# Block 1
myoptimizer = tf.keras.optimizers.SGD(learning_rate=1)
# store weights before updating the weights based on the gradients
old_enc_weights = deepcopy(net.encoder.get_weights())
old_dec_weights = deepcopy(net.decoder.get_weights())
myoptimizer.apply_gradients(zip(enc_grad, net.encoder.trainable_variables))
myoptimizer.apply_gradients(zip(dec_grad, net.decoder.trainable_variables))
# manually calculate the weights after gradient update
# since the learning rate is 1, new_weights = weights - grad
cal_enc_weights = []
for weights, grad in zip(old_enc_weights, enc_grad):
cal_enc_weights.append(weights-grad)
cal_dec_weights = []
for weights, grad in zip(old_dec_weights, dec_grad):
cal_dec_weights.append(weights-grad)
for weights, man_calc_weight in zip(net.encoder.get_weights(), cal_enc_weights):
print(np.linalg.norm(weights-man_calc_weight))
for weights, man_calc_weight in zip(net.decoder.get_weights(), cal_dec_weights):
print(np.linalg.norm(weights-man_calc_weight))
# block 2
old_weights = deepcopy(net.encoder.trainable_variables + net.decoder.trainable_variables)
myoptimizer.apply_gradients(zip(gradients, net.encoder.trainable_variables + \
net.decoder.trainable_variables))
cal_weights = []
for weight, grad in zip(old_weights, gradients):
cal_weights.append(weight-grad)
for weight, man_calc_weight in zip(net.encoder.trainable_variables + net.decoder.trainable_variables, cal_weights):
print(np.linalg.norm(weight-man_calc_weight))
You will see that both the methods update the weights in the exact same way.
I think you used an optimizer like Adam/RMSProp which is stateful. For such optimizers invoking apply_gradients will update the optimizer parameters based on the gradient value and sign. In the first case, the optimizer parameters are updated twice and in the second case only once.
I would stick to the second option if I were you, since you are performing just one step of optimization here.

Resnet based (Tensorflow Keras) Siamese Model providing `nan` validation loss in training when using TripletHardLoss (Semi too)

I have a model which I built on top of ResNet. I am using 25k Similar type of Images. My images have text as well as some diagram. When I used the Euclidean Distance + Binary loss, I got an accuracy of 95% with Inception but same with Triplet Hard/ Semi Hard Loss gave me nan loss and almost 0 accuracy. Please tell me if there is something wrong with the code structure.
import tensorflow_addons as tfa
from tensorflow.keras.applications.resnet50 import preprocess_input as res50_pre, ResNet50
shape = (224,224,3)
lr = 0.001
loss = tfa.losses.TripletSemiHardLoss()
epochs = 50
batch_size = 128 #254 gives 'log' referenced before assignment error
datagen = ImageDataGenerator(preprocessing_function=res50_pre,validation_split=0.2)
train_data = datagen.flow_from_dataframe(df,x_col='path',y_col='label',class_mode='sparse',target_size=(224,224),
batch_size=batch_size,subset='training',seed=SEED)
val_data = datagen.flow_from_dataframe(df,x_col='path',y_col='label',class_mode='sparse',target_size=(224,224),
batch_size=batch_size,subset='validation',seed=SEED)
base_model = ResNet50(weights='imagenet',input_shape=shape,include_top=False,pooling='avg')
base_model.trainable = True
inputs = keras.Input(shape=shape)
x = base_model(inputs,training=True)
outputs = keras.layers.Lambda(lambda x: tf.math.l2_normalize(x, axis=1))(x) # L2 normalize embeddings
model = keras.Model(inputs, outputs)
for layer in model.layers: # set all the parameters trainable
layer.trainable = True
model.compile(optimizer=tf.keras.optimizers.Adam(lr),loss=loss,metrics=['accuracy'])
history = model.fit(train_data,epochs=epochs,steps_per_epoch=len(train_data)//batch_size,validation_data=val_data,verbose=2)
My group has values like 1,2,3 [Not in order and some missing] which represent the same type of data. I used Sparse after converting the value to str(1), str(3) etc.
My DataFrame looks like this:
increase batch size to reduce probability of a mini batch not including any triplets.
Edit: I published a package for generating TF/Keras balanced batches to solve this problem https://github.com/ma7555/kerasgen

Linear Regression model On tensorflow can't learn bias

I am trying to train a linear regression model in Tensorflow using some generated data. The model seems to learn the slope of the line, but is unable to learn the bias.
I have tried changing the no. of epochs, the weight(slope) and the biases, but every time , the learnt bias by the model comes out to be zero. I don't know where I am going wrong and some help would be appreciated.
Here is the code.
import numpy as np
import tensorflow as tf
# assume the linear model to be Y = W*X + b
X = tf.placeholder(tf.float32, [None, 1])
Y = tf.placeholder(tf.float32, [None,1])
# the weight and biases
W = tf.Variable(tf.zeros([1,1]))
b = tf.Variable(tf.zeros([1]))
# the model
prediction = tf.matmul(X,W) + b
# the cost function
cost = tf.reduce_mean(tf.square(Y - prediction))
# Use gradient descent
learning_rate = 0.000001
train_step =
tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
steps = 1000
epochs = 10
Verbose = False
# In the end, the model should learn these values
test_w = 3
bias = 10
for _ in xrange(epochs):
for i in xrange(steps):
# make fake data for the model
# feed one example at a time
# stochastic gradient descent, because we only use one example at a time
x_temp = np.array([[i]])
y_temp = np.array([[test_w*i + bias]])
# train the model using the data
feed_dict = {X: x_temp, Y:y_temp}
sess.run(train_step,feed_dict=feed_dict)
if Verbose and i%100 == 0:
print("Iteration No: %d" %i)
print("W = %f" % sess.run(W))
print("b = %f" % sess.run(b))
print("Finally:")
print("W = %f" % sess.run(W))
print("b = %f" % sess.run(b))
# These values should be close to the values we used to generate data
https://github.com/HarshdeepGupta/tensorflow_notebooks/blob/master/Linear%20Regression.ipynb
Outputs are in the last line of code.
The model needs to learn test_w and bias (in the notebook link, it is in the 3rd cell, after the first comment), which are set to 3 and 10 respectively.
The model correctly learns the weight(slope), but is unable to learn the bias. Where is the error?
The main problem is that you are feeding just one sample at a time to the model. This makes your optimizer very inestable, that's why you have to use such a small learning rate. I will suggest you to feed more samples in each step.
If you insist in feeding one sample at a time, maybe you should consider using an optimizer with momentum, like tf.train.AdamOptimizer(learning_rate). This way you can increase the learning rate and reach convergence.

Using Tensorflow's Connectionist Temporal Classification (CTC) implementation

I'm trying to use the Tensorflow's CTC implementation under contrib package (tf.contrib.ctc.ctc_loss) without success.
First of all, anyone know where can I read a good step-by-step tutorial? Tensorflow's documentation is very poor on this topic.
Do I have to provide to ctc_loss the labels with the blank label interleaved or not?
I could not be able to overfit my network even using a train dataset of length 1 over 200 epochs. :(
How can I calculate the label error rate using tf.edit_distance?
Here is my code:
with graph.as_default():
max_length = X_train.shape[1]
frame_size = X_train.shape[2]
max_target_length = y_train.shape[1]
# Batch size x time steps x data width
data = tf.placeholder(tf.float32, [None, max_length, frame_size])
data_length = tf.placeholder(tf.int32, [None])
# Batch size x max_target_length
target_dense = tf.placeholder(tf.int32, [None, max_target_length])
target_length = tf.placeholder(tf.int32, [None])
# Generating sparse tensor representation of target
target = ctc_label_dense_to_sparse(target_dense, target_length)
# Applying LSTM, returning output for each timestep (y_rnn1,
# [batch_size, max_time, cell.output_size]) and the final state of shape
# [batch_size, cell.state_size]
y_rnn1, h_rnn1 = tf.nn.dynamic_rnn(
tf.nn.rnn_cell.LSTMCell(num_hidden, state_is_tuple=True, num_proj=num_classes), # num_proj=num_classes
data,
dtype=tf.float32,
sequence_length=data_length,
)
# For sequence labelling, we want a prediction for each timestamp.
# However, we share the weights for the softmax layer across all timesteps.
# How do we do that? By flattening the first two dimensions of the output tensor.
# This way time steps look the same as examples in the batch to the weight matrix.
# Afterwards, we reshape back to the desired shape
# Reshaping
logits = tf.transpose(y_rnn1, perm=(1, 0, 2))
# Get the loss by calculating ctc_loss
# Also calculates
# the gradient. This class performs the softmax operation for you, so inputs
# should be e.g. linear projections of outputs by an LSTM.
loss = tf.reduce_mean(tf.contrib.ctc.ctc_loss(logits, target, data_length))
# Define our optimizer with learning rate
optimizer = tf.train.RMSPropOptimizer(learning_rate).minimize(loss)
# Decoding using beam search
decoded, log_probabilities = tf.contrib.ctc.ctc_beam_search_decoder(logits, data_length, beam_width=10, top_paths=1)
Thanks!
Update (06/29/2016)
Thank you, #jihyeon-seo! So, we have at input of RNN something like [num_batch, max_time_step, num_features]. We use the dynamic_rnn to perform the recurrent calculations given the input, outputting a tensor of shape [num_batch, max_time_step, num_hidden]. After that, we need to do an affine projection in each tilmestep with weight sharing, so we've to reshape to [num_batch*max_time_step, num_hidden], multiply by a weight matrix of shape [num_hidden, num_classes], sum a bias undo the reshape, transpose (so we will have [max_time_steps, num_batch, num_classes] for ctc loss input), and this result will be the input of ctc_loss function. Did I do everything correct?
This is the code:
cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)
h_rnn1, self.last_state = tf.nn.dynamic_rnn(cell, self.input_data, self.sequence_length, dtype=tf.float32)
# Reshaping to share weights accross timesteps
x_fc1 = tf.reshape(h_rnn1, [-1, num_hidden])
self._logits = tf.matmul(x_fc1, self._W_fc1) + self._b_fc1
# Reshaping
self._logits = tf.reshape(self._logits, [max_length, -1, num_classes])
# Calculating loss
loss = tf.contrib.ctc.ctc_loss(self._logits, self._targets, self.sequence_length)
self.cost = tf.reduce_mean(loss)
Update (07/11/2016)
Thank you #Xiv. Here is the code after the bug fix:
cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)
h_rnn1, self.last_state = tf.nn.dynamic_rnn(cell, self.input_data, self.sequence_length, dtype=tf.float32)
# Reshaping to share weights accross timesteps
x_fc1 = tf.reshape(h_rnn1, [-1, num_hidden])
self._logits = tf.matmul(x_fc1, self._W_fc1) + self._b_fc1
# Reshaping
self._logits = tf.reshape(self._logits, [-1, max_length, num_classes])
self._logits = tf.transpose(self._logits, (1,0,2))
# Calculating loss
loss = tf.contrib.ctc.ctc_loss(self._logits, self._targets, self.sequence_length)
self.cost = tf.reduce_mean(loss)
Update (07/25/16)
I published on GitHub part of my code, working with one utterance. Feel free to use! :)
I'm trying to do the same thing.
Here's what I found you may be interested in.
It was really hard to find the tutorial for CTC, but this example was helpful.
And for the blank label, CTC layer assumes that the blank index is num_classes - 1, so you need to provide an additional class for the blank label.
Also, CTC network performs softmax layer. In your code, RNN layer is connected to CTC loss layer. Output of RNN layer is internally activated, so you need to add one more hidden layer (it could be output layer) without activation function, then add CTC loss layer.
See here for an example with bidirectional LSTM, CTC, and edit distance implementations, training a phoneme recognition model on the TIMIT corpus. If you train on that corpus's training set, you should be able to get phoneme error rates down to 20-25% after 120 epochs or so.