Sentiment Analysis with word embeddings using Keras Embedding Layer - tensorflow

I need a little bit of clarification regarding my model results.
Here is my use case:
Deciding whether a review from a company from S&P 500 is negative or positive. I used a crawled data set from indeed. (The dataset was labelled (0 - positive, 1 - negative), tokenized and cleaned).
Here are some important information, in order to understand the model and my approach:
# Constants
NB_WORDS = 44000 # Parameter indicating the number of words we'll put in the dictionary
VAL_SIZE = 1000 # Size of the validation set
NB_START_EPOCHS = 10 # Number of epochs we usually start to train with
EPOCH_ITER = list(range(0,11)) # For stepwise evaluating the accuracy metrics for 10 epochs
BATCH_SIZE = 512 # Size of the batches used in the mini-batch gradient descent
MAX_LEN = 267 # Maximum number of words in a sequence (review)
REV_DIM = 300 # Number of dimensions of the indeed review word embeddings --> most common Mikolow et al., 2013
# Modeling
emb_model = models.Sequential()
emb_model.add(layers.Embedding(NB_WORDS, REV_DIM, input_length=MAX_LEN))
# Embedding layer is first hidden layer
"""
Embedding Layer (
input_length = no. of words in vocabularly;
output_dim = dimensionality;
max_length = length of largest review
)
"""
emb_model.add(layers.Flatten())
# Flatten Layers are reshaping tensor to 1-D array
emb_model.add(layers.Dense(2, activation='softmax'))
# Is the regular deeply connected neural network layer. It is most common and
# frequently used layer. Dense layer does the below operation on the input and return the output.
# Operation := output = activation(dot(input, kernel) + bias)
# further see: https://www.tutorialspoint.com/keras/keras_dense_layer.htm#:~:text=Advertisements,input%20and%20return%20the%20output.
# Defines the output size in our case 2, hence positive or negative (0 or 1)
emb_model.summary()
I already did some interpretation. But since I'm a beginner, I really need further information/interpretations/tips, especially on how and why to improve my model.
Here are my results:

Related

Number of nodes in output later greater than number of classes in a neural network

While training a neural network, on the fashion mnist dataset, I decided to have a greater number of nodes in my output layer than the number of classes in the dataset.
The dataset has 10 classes, while I trained my network to have 15 nodes in the output layer. I also used a softmax.
Now surprisingly, this gave me an accuracy of 97% which is quite good.
This leads me to the question, what do those extra 5 nodes even mean, and what do they do here?
Why is my softmax able to work properly when the label range(0-9) isn't equal to the number of nodes(15)?
And finally, in general, what does it mean to have more nodes in your output layer than the number of classes, in a classification task?
I understand the effects of having lesser nodes than the number of classes, and also that the rule of thumb is to use number of nodes = number of classes. Yet, I've never seen someone use a greater number of nodes, and I'd like to understand why/why not.
I'm attaching some code so that the results can be reproduced. This was done using Tensorflow 2.3
import tensorflow as tf
print(tf.__version__)
mnist = tf.keras.datasets.mnist
(training_images, training_labels) , (test_images, test_labels) = mnist.load_data()
training_images = training_images/255.0
test_images = test_images/255.0
model = tf.keras.models.Sequential([tf.keras.layers.Flatten(),
tf.keras.layers.Dense(256, activation=tf.nn.relu),
tf.keras.layers.Dense(15, activation=tf.nn.softmax)])
model.compile(optimizer = 'adam',
loss = 'sparse_categorical_crossentropy',
metrics = ['accuracy'])
model.fit(training_images, training_labels, epochs=5)
model.evaluate(test_images, test_labels)
The only reason you are able to use such a configuration is because you have specified your loss function as sparse_categorical_crossentropy.
let's understand the effects of greater output nodes in forward propagation.
Consider a neural network with 2 layers.
1st layer - 6 neurons (Hidden layer)
2nd layer - 4 neurons (output layer)
You have dataset X whose shape is(100*12) ie. 12 features and 100 rows.
you have labels y whose shape is (100,) containing two unique values 0 and 1.
Therefore essentially this is a binary classification problem but we will use 4 neurons in our output layer.
Consider each neuron as a logistic regression unit. Therefore each of your neurons will 12 weights (w1, w2,.....,w12)
Why? - Because you have 12 features.
Each neuron will output a single term given by a. I will give the computation of a in two steps.
z = w1x1 + w2x2 + ........ + w12*x12 + w0 # w0 is bias
a = activation(z)
Therefore, your 1st layer will output 6 values for each row in our dataset.
So now you have a feature matrix of 100 * 6.
This is passed to the 2nd layer and the same process repeats.
So in essence you are able to complete the forward propagation step even when you have more neurons than the actual classes.
Now let's see backpropagation.
For backpropagation to exist you must be able to calculate the loss_value.
we will take a small example:
y_true has two labels as in our problem and y_pred has 4 probability values since we have 4 units in our final layer.
y_true = [0, 1]
y_pred = [[0.03, 0.90, 0.02, 0.05], [0.15, 0.02, 0.8, 0.03]]
# Using 'auto'/'sum_over_batch_size' reduction type.
scce = tf.keras.losses.SparseCategoricalCrossentropy()
scce(y_true, y_pred).numpy() # 3.7092905
How is it calculated:
( log(0.03) + log(0.02) ) / 2
So essentially we can compute the loss so we can also compute its gradients.
Therefore no problem in using backpropagation too.
Therefore our model can very well train and achieve 90 % accuracy.
So the final question, what are these extra neurons representing. ie( neuron 2 and neuron 3).
Ans - They are representing the probability of the example being of class 2 and class 3 respectively. But since the labels contain no values of class 2 and class 3 they will have zero contribution in calculating the loss value.
Note- If you encode your y_label in one-hot-encoding and use categorical_crossentropy as your loss you will encounter an error.

How to setup a neural network architecture for binary classification

I am reading through the tensorflow tutorials on neural network and i came across the architecture part which is a bit confusing. Can some explain me why he had use following settings in this code
# input shape is the vocabulary count used for the movie reviews
(10,000 words)
vocab_size = 10000
model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))
model.summary()
Vocab_size?
value of 16 for Embedding?
and the choice of units, i get the intuition behind the last dense layer because it is a binary classification(1) but why 16 units in the second layer?
Is the 16 in embedding and 16 units in first dense layer related? Like they should be equal?
If someone can explain this para too
The first layer is an Embedding layer. This layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: (batch, sequence, embedding).
source:
Classify movie reviews: binary classification
vocab_size: All word in your corpus (in this case IMDB) sorted based on their frequency and their top 10000 word extracted. Rest of the vocabulary will be ignored. E.g: This is really Fancyyyyyyy will convert into ==> [8 7 9]. As you may guess the word Fancyyyyyyy ignored because its not in out top 10000 words.
pad_sequences: Convert all sentence to the same size. For example in training corpus the document length are different. So all of them convert to seq_len = 256. After this step, your output is [Batch_size * seq_len].
Embedding: Each word converted to a vector with 16 dimension. As a result output of this step is a Tensor with size of [Batch_size * seq_len * embedding_dim].
GlobalAveragePooling1D: Convert your sequence with size of [Batch_size * seq_len * embedding_dim] into [Batch_size * embedding_dim]
unit: is output of dense layer (MLP layer). It covert [Batch_size * embedding_dim] into [Batch_size * unit].
The first layer is vocab_size because each word is represented as an index into the vocabulary. For example, if the input word is 'word', which is the 500th word in the vocabulary, the input is a vector of length vocab_size with all zeros except a one at index 500. This is commonly referred to as a 'one hot' representation.
The embedding layer essentially takes this huge input vector and condenses it into a smaller vector (in this case, length 16) that encodes some of information about the word. The specific embedding weights are learned from training just like any other neural network layer. I'd recommend reading up on word embeddings. The length of 16 is a bit arbitrary here but can be tuned. One could do away with this embedding layer but then the model will have less expressive power (it would just be logistic regression, which is a linear model).
Then, as you said, the last layer is simply predicting the class of the word based on the embedding.

Strange sequence classification performance after shuffling sequence elements

I have one million sequences I'm trying to classify as either 0 or 1. The outcome is fairly well balanced (class 0:70%, class 1:30%). Maximum sequence length is 50, and I've post-padded by sequences with zeroes. There are 100 unique sequence symbols. Embedding length is 30. It's an LSTM NN trained on two outputs (one is the main output node, and the other is right after the LSTM). The code is below.
As a sanity check, I ran three versions of this: One in which I randomize the outcome labels (I expect terrible performance), another one where the labels are correct but I randomize the sequence of events in each sequence but the outcome labels are correct (I also expected bad performance), and finally one where everything is left unshuffled (I expected good performance).
Instead I found the following:
Shuffled labels: Accuracy = 69.5% (Model predicts every sequence is class 0)
Shuffled sequence symbols: Accuracy = 88%!
Nothing is shuffled: Accuracy = 90%
What do you make of this? All I can think of is that there is little signal to be gained from analyzing the sequences, and maybe most of the signal is from the presence or lack of presence of symbols in the sequence. Maybe RNNs and LSTMs are overkill here?
# Input 1: event type sequences
# Take the event integer sequences, run them through an embedding layer to get float vectors, then run through LSTM
main_input = Input(shape =(max_seq_length,), dtype = 'int32', name = 'main_input')
x = Embedding(output_dim = embedding_length, input_dim = num_unique_event_symbols, input_length = max_seq_length, mask_zero=True)(main_input)
lstm_out = LSTM(32)(x)
# Auxiliary loss here from first input
auxiliary_output = Dense(1, activation='sigmoid', name='aux_output')(lstm_out)
# An abitrary number of dense, hidden layers here
x = Dense(64, activation='relu')(lstm_out)
# The main output node
main_output = Dense(1, activation='sigmoid', name='main_output')(x)
## Compile and fit the model
model = Model(inputs=[main_input], outputs=[main_output, auxiliary_output])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'], loss_weights=[1., 0.2])
print(model.summary())
np.random.seed(21)
model.fit([train_X1], [train_Y, train_Y], epochs=1, batch_size=200)
Assuming you've played around with the size of the LSTM, your conclusion seems reasonable. Beyond that, it's hard to say as it depends what the dataset is. For example, it could be that shorter sequences are more unpredictable, and if most of your sequences are short, then this would support the conclusion as well.
It's worth it to also try truncating your sequences in length, to say the first 25 entries.

How to stack LSTM on top of CNN in Keras?

I made the following neural network model for sound recognition purpose. The flowchart is like the following:
cnn-lstm-dense-hybrid(please click here)
The idea is the following:
I have 2 different input layers, called A and B.
(i) Input A has 100 time steps, each step has a 64-dimensional feature vector
(ii)A 1D CNN layer(Time distributed) will extract features from each time step. The CNN layer contains 64 filters, each has length 16 taps. Then, a maxpooling layer will extract the single maximum value of each convolutional output, so a total of 64 features will be extracted at each time step.
(iii) The output of the CNN layer will be fed into an LSTM layer with 64 neurons. Number of recurrence is the same as time step of input, which is 100 time steps. The LSTM layer should return a sequence of 64-dimensional output (the length of sequence == number of time steps == 100, so there should be 100*64=6400 numbers).
(iv) Meanwhile, input B also has 100 time steps, each step has a 65-dimensional feature vector, but they are treated differently from input A.
(v)Input B is fed into a dense layer (Time distributed) of 65 neurons, so it should produce a 65-dimensional output at each time step.
Now, at each time step, we have output from LSTM layer (64 neurons) and Dense layer (65 neurons), we concatenate them in a merge layer. Now we get a 129-dimensional vector at each time step.
We feed this vector into another dense layer, which produces the output (single neuron, which represents the probability of "is target sound")
A hand drawn illustration
However, I am stuck at the very beginning trying to make 1(i) work. The code of network building is below:
mfcc_input = Input(shape=(100,64), dtype='float', name='mfcc_input')
print(mfcc_input)
CNN_out = TimeDistributed(Conv1D(64, 16, activation='relu'))(mfcc_input)
CNN_out = BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True)(CNN_out)
CNN_out = TimeDistributed(MaxPooling1D(pool_size=(64-16+1), strides=None, padding='valid'))(CNN_out)
CNN_out = Dropout(0.4)(CNN_out)
LSTM_out = LSTM(64,return_sequences=True)(CNN_out)
## Auxilliary branch
delta_input = Input(shape=(100,64), dtype='float', name='delta_input')
zcr_input = Input(shape=(100,1), dtype='float', name='zcr_input')
aux_input = concatenate([delta_input, zcr_input])
aux_out = TimeDistributed(Dense(64+1))(aux_input)
### Merge branches
merged_layer = concatenate([LSTM_out, aux_out])
## Output layer
output = TimeDistributed(Dense(1))(merged_layer)
model = Model(inputs=[mfcc_input, delta_input, zcr_input], outputs=[output])
model.compile(optimizer='rmsprop', loss='binary_crossentropy',
loss_weights=[1., 0.2])
...(other code here) ...
The error at "CNN_out = TimeDistributed(Conv1D(64, 16, activation='relu'))(mfcc_input)" is: IndexError: list index out of range
Anyone could help? Greatly appreciate!

Custom dropout in tensorflow

I'm training a DNN model on some data, and am hoping to analyze the learned weights to learn something about the true system I am studying (signaling cascades in biology). I guess one could say I am using Artificial NNs to learn about Biological NNs.
For each of my training examples, I have removed a single gene, that is responsible for signaling at the top layer.
As I am modeling this signaling cascade as a NN, and removing one of the nodes in the first hidden layer, I realized that I'm doing a real life version of dropout.
I would therefore like to use dropout to train my model, however the implementations of dropout that I have seen online seem to randomly drop out a node. What I need is a way to specify which node to dropout for each training example.
Any advice on how to implement this? I'm open to any package, but right now everything i have already done is in Tensorflow so I'd appreciate a solution that uses that framework.
For those that prefer the details explained:
I have 10 input variables, that are fully connected to 32 relu nodes in the first layer, which are fully connected to a second layer (relu), which is fully connected to the output (linear because I am doing regression).
In addition to the 10 input variables, I also happen to know which of the 28 nodes should be dropped out.
Is there a way I can specify this when training?
Here is the code I currently use:
num_stresses = 10
num_kinase = 32
num_transcription_factors = 200
num_genes = 6692
# Build neural network
# Input variables (10)
# Which Node to dropout (32)
stress = tflearn.input_data(shape=[None, num_stresses])
kinase_deletion = tflearn.input_data(shape=[None, num_kinase])
# This is the layer that I want to perform selective dropout on,
# I should be able to specify which of the 32 nodes should output zero
# based on a 1X32 vector of ones and zeros.
kinase = tflearn.fully_connected(stress, num_kinase, activation='relu')
transcription_factor = tflearn.fully_connected(kinase, num_transcription_factors, activation='relu')
gene = tflearn.fully_connected(transcription_factor, num_genes, activation='linear')
adam = tflearn.Adam(learning_rate=0.00001, beta1=0.99)
regression = tflearn.regression(gene, optimizer=adam, loss='mean_square', metric='R2')
# Define model
model = tflearn.DNN(regression, tensorboard_verbose=1)
I would supply your input variables and an equal sized vector of all 1's except for the one you want to drop, that one is a 0.
The very first operation should then be multiplication to zero out the gene you want to drop. From there on out, it should be the exact same as what you have now.
You can either multiply (zero out your gene) before handing it to tensorflow or add another place holder and feed it into the graph in the feed_dict like you do your variables. The latter one would probably be better.
If you need to drop a hidden node (in layer 2), it's just another vector of 1s and a 0.
Let me know if that works or if you need more help.
Edit:
Ok, so I haven't really worked with tflearn very much (I just did regular tensorflow), but I think you can combine tensorflow and tflearn. Basically, I added tf.multiply. You might have to add in another tflearn.input_data(shape =[num_stresses]) and tflearn.input_data(shape =[num_kinase]) to give you placeholders for the stresses_dropout_vector and kinase_dropout_vector. And of course, you can change the number and positions of zeros in those two vectors.
import tensorflow as tf ###### New ######
import tflearn
num_stresses = 10
num_kinase = 32
num_transcription_factors = 200
num_genes = 6692
stresses_dropout_vector = [1] * num_stresses ###### NEW ######
stresses_dropout_vector[desired_node_to_drop] = 0 ###### NEW ######
kinase_dropout_vector = [1] * num_kinase ###### NEW ######
kinase_dropout_vector[desired_hidden_node_to_drop] = 0 ###### NEW ######
# Build neural network
# Input variables (10)
# Which Node to dropout (32)
stress = tflearn.input_data(shape=[None, num_stresses])
kinase_deletion = tflearn.input_data(shape=[None, num_kinase])
# This is the layer that I want to perform selective dropout on,
# I should be able to specify which of the 32 nodes should output zero
# based on a 1X32 vector of ones and zeros.
stress_dropout = tf.multiply(stress, stresses_dropout_vector) ###### NEW ###### Drops out an input
kinase = tflearn.fully_connected(stress_dropout, num_kinase, activation='relu') ### changed stress to stress_dropout
kinase_dropout = tf.multiply(kinase, kinase_dropout_vector) ###### NEW ###### Drops out a hidden node
transcription_factor = tflearn.fully_connected(kinase_dropout, num_transcription_factors, activation='relu') ### changed kinase to kinase_dropout
gene = tflearn.fully_connected(transcription_factor, num_genes, activation='linear')
adam = tflearn.Adam(learning_rate=0.00001, beta1=0.99)
regression = tflearn.regression(gene, optimizer=adam, loss='mean_square', metric='R2')
# Define model
model = tflearn.DNN(regression, tensorboard_verbose=1)
If adding in tensorflow doesn't work, you just have to find a regular old tflearn.multiply function that does an element wise multiplication of two given tensors/vectors.
Hope that helps.
For completeness, here is my final implementation:
import numpy as np
import pandas as pd
import tflearn
import tensorflow as tf
meta = pd.read_csv('../../input/nn/meta.csv')
experiments = meta["Unnamed: 0"]
del meta["Unnamed: 0"]
stress_one_hot = pd.get_dummies(meta["train"])
kinase_deletion = pd.get_dummies(meta["Strain"])
kinase_one_hot = 1 - kinase_deletion
expression = pd.read_csv('../../input/nn/data.csv')
genes = expression["Unnamed: 0"]
del expression["Unnamed: 0"] # This holds the gene names just so you know...
expression = expression.transpose()
# Set up data for tensorflow
# Gene expression
target = expression
target = np.array(expression, dtype='float32')
target_mean = target.mean(axis=0, keepdims=True)
target_std = target.std(axis=0, keepdims=True)
target = target - target_mean
target = target / target_std
# Stress information
data1 = stress_one_hot
data1 = np.array(data1, dtype='float32')
data_mean = data1.mean(axis=0, keepdims=True)
data_std = data1.std(axis=0, keepdims=True)
data1 = data1 - data_mean
data1 = data1 / data_std
# Kinase information
data2 = kinase_one_hot
data2 = np.array(data2, dtype='float32')
# For Reference
# data1.shape
# #(301, 10)
# data2.shape
# #(301, 29)
# Build the Neural Network
num_stresses = 10
num_kinase = 29
num_transcription_factors = 200
num_genes = 6692
# Build neural network
# Input variables (10)
# Which Node to dropout (32)
stress = tflearn.input_data(shape=[None, num_stresses])
kinase_deletion = tflearn.input_data(shape=[None, num_kinase])
# This is the layer that I want to perform selective dropout on,
# I should be able to specify which of the 32 nodes should output zero
# based on a 1X32 vector of ones and zeros.
kinase = tflearn.fully_connected(stress, num_kinase, activation='relu')
kinase_dropout = tf.mul(kinase, kinase_deletion)
transcription_factor = tflearn.fully_connected(kinase_dropout, num_transcription_factors, activation='relu')
gene = tflearn.fully_connected(transcription_factor, num_genes, activation='linear')
adam = tflearn.Adam(learning_rate=0.00001, beta1=0.99)
regression = tflearn.regression(gene, optimizer=adam, loss='mean_square', metric='R2')
# Define model
model = tflearn.DNN(regression, tensorboard_verbose=1)
# Start training (apply gradient descent algorithm)
model.fit([data1, data2], target, n_epoch=20000, show_metric=True, shuffle=True)#,validation_set=0.05)