Unable to track record by record processing in LSTM algorithm for text classification? - tensorflow

We are working on multi-class text classification and following is the process which we have used.
1) We have created 300 dim's vector with word2vec word embedding using our own data and then passed that vector as a weights to LSTM embedding layer.
2) And then we have used one LSTM layer and one dense layer.
Here below is my code:
input_layer = layers.Input((train_seq_x.shape[1], ))
embedding_layer = layers.Embedding(len(word_index)+1, 300, weights=[embedding_matrix], trainable=False)(input_layer)
embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)
lstm_layer1 = layers.LSTM(300,return_sequences=True,activation="relu")(embedding_layer)
lstm_layer1 = layers.Dropout(0.5)(lstm_layer1)
flat_layer = layers.Flatten()(lstm_layer1)
output_layer = layers.Dense(33, activation="sigmoid")(flat_layer)
model = models.Model(inputs=input_layer, outputs=output_layer)
model.compile(optimizer=optimizers.Adam(), loss='categorical_crossentropy',metrics=['accuracy'])
Please help me out on the below questions:
Q1) Why did we pass word embedding vector(300 dim's) as weights in LSTM embedding layer?
Q2) How can we know optimal number of neural in LSTM layer?
Q3) Can you please explain how the single record processing in LSTM algorithm?
Please let me know if you requires more information on the same.

Q1) Why did we pass word embedding vector(300 dim's) as weights in
LSTM embedding layer?
In a very simplistic way, you can think of an embedding layers as a lookup table which converts a word (represented by its index in a dictionary) to a vector. It is a trainable layers. Since you have already trained word embeddings instead of initializing the embedding layer with the random weight you initialize it with the vectors you have learned.
Embedding(len(word_index)+1, 300, weights=[embedding_matrix], trainable=False)(input_layer)
So here you are
creating an embedding layer or a look up table which can lookup words
indices 0 to len(word_index).
Each lookuped up word will map to a vector of size 300.
This lookup table is loaded with the vectors from "embedding_matrix"
(which is a pretrained model).
trainable=False will freez the weight in this layer.
You have passed 300 because it is the vector size of your pretrained model (embedding_matrix)
Q2) How can we know optimal number of neural in LSTM layer?
You have created a LSTM layer with takes 300 size vector as input and returns a vector of size 300. The output size and number of stacked LSTMS are hyperparameters which is tuned manually (usually using KFold CV)
Q3) Can you please explain how the single record processing in LSTM
algorithm?
A single record/sentence(s) are converted into indices of the vocabulary. So for every sentence you have an array of indices.
A batch of these sentences are created and feed as input to the model.
LSTM is unwrapped by passing in one index at a time as input at each timestep.
Finally the ouput of the LSTM is forward propagated by a final dense
layer to size 33. So looks like each input is mapped to one of 33
classes in your case.
Simple example
import numpy as np
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten, LSTM
from keras.layers.embeddings import Embedding
from nltk.lm import Vocabulary
from keras.utils import to_categorical
training_data = [ "it was a good movie".split(), "it was a bad movie".split()]
training_target = [1, 0]
v = Vocabulary([word for s in training_data for word in s])
model = Sequential()
model.add(Embedding(len(v),50,input_length = 5, dropout = 0.2))
model.add(LSTM(10, dropout_U = 0.2, dropout_W = 0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())
x = np.array([list(map(lambda x: v[x], s)) for s in training_data])
y = to_categorical(training_target)
model.fit(x,y)

Related

How to create a Keras model with switchable input layers?

Current simple model:
from tensorflow.keras import Model
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Input
from tensorflow.keras.optimizers import Adam
def model():
input_A = Input(shape=(6, ))
out = Dense(64, activation="relu")(input_A)
out = Dense(32, activation="relu")(out)
outputs = Dense(1, activation="tanh")(out)
model = Model(
inputs=input_A,
outputs=outputs,
name="switchable_inputs_model")
model.compile(loss="mse", optimizer=Adam(), metrics=["accuracy"])
return model
I want to have another input layer input_B which will not be active all the time during learning. Let us say we have two input layers: input A, input B. However, at a given time, only one input layer can be active. This selection of input layer is decided by a binary combination of information available at the execution time(learning stage). For instance, if it is 1 0, then input layer A will be used. Similarly, if it is 0 1, input layer B will be used.
How can I do this?
It's hard to guess from your question what you are trying to accomplish in detail, but you should carefully consider if that is necessary.
It's common practice to have an input layer of a fixed size that matches the structure of your data. You preprocess your data to match that shape.
In the domain of e.g. images this might mean:
If you have images of different resolutions, you could consider cropping, padding or resizing your inputs to a fixed size.
If there is a rationale behind this please clarify.

tfidf weighted average of word embeddings with Keras

I don't know how this is possible, but I want to calculated some weighted average of word embeddings in a sentence like with tfidf scores.
Is it exactly this, but with just weights:
averaging a sentence’s word vectors in Keras- Pre-trained Word Embedding
import keras
from keras.layers import Embedding
from keras.models import Sequential
import numpy as np
# Set parameters
vocab_size=1000
max_length=10
# Generate random embedding matrix for sake of illustration
embedding_matrix = np.random.rand(vocab_size,300)
model = Sequential()
model.add(Embedding(vocab_size, 300, weights=[embedding_matrix],
input_length=max_length, trainable=False))
# Average the output of the Embedding layer over the word dimension
model.add(keras.layers.Lambda(lambda x: keras.backend.mean(x, axis=1)))
model.summary()
How could you get with a custom layer or lambda layer the proper weights belonging to a specific word? You would need access somehow the embedding layer to get the index and then look up the proper weight.
Or is there a simple way I don't see?
embeddings = model.layers[0].get_weights()[0] # get embedding layer, shape (vocab, embedding_dim)
Alternatively, if you define the layer object:
embedding_layer = Embedding(vocab_size, 300, weights=[embedding_matrix], input_length=max_length, trainable=False)
embeddings = emebdding_layer.get_weights()[0]
From here, you can probably directly address the individual weights by just querying their positions using your unprocessed bag of words or integer inputs.
If you want to, you can additionally go through the actual word vectors by the string words, though that shouldn't be necessary for simply accumulating all word vectors of each sentence:
# `word_to_index` is a mapping (i.e. dict) from words to their index that you need to provide (from your original input data which should be ints)
word_embeddings = {w:embeddings[idx] for w, idx in word_to_index.items()}
print(word_embeddings['chair']) # gives you the word vector

tf.keras loss from two images in serial

I want to use the stability training approach of the paper and apply it to a very simple CNN.
The principle architecture is given by:
As shown in the figure you compute the loss based on the output f(I) for the input image I and on
the output f(I') for the perturbed image I'.
My question would be how to do this in a valid way without having two instances of the DNN,
as I'm training on large 3D images. In other words: how can I process two images in serial and compute the loss based on those two images?
I'm using tf2 with keras.
You can first write your DNN as a tf.keras Model.
After that, you can write another model which takes two image inputs, applies some Gaussian noise to one, passes them to DNN.
Design a custom loss function which finds the proper loss from the two outputs.
Here's a demo code:
from tensorflow.keras.layers import Input, Dense, Add, Activation, Flatten
from tensorflow.keras.models import Model
import tensorflow as tf
import numpy as np
import random
from tensorflow.python.keras.layers import Input, GaussianNoise, BatchNormalization
# shared DNN, this is the base model with a feature-space output, there is only once instance of the model
ip = Input(shape=(32,32,1)) # same as original inputs
f0 = Flatten()(ip)
d0 = Dense(10)(f0) # 10 dimensional feature embedding
dnn = Model(ip, d0)
# final model with two version of images and loss
input_1 = Input(shape=(32,32,1))
input_2 = Input(shape=(32,32,1))
g0 = GaussianNoise(0.5)(input_2) # only input_2 passes through gaussian noise layer, you can design your own custom layer too
# passing the two images to same DNN
path1 = dnn(input_1) # no noise
path2 = dnn(g0) # noise
model = Model([input_1, input_2], [path1, path2])
def my_loss(y_true, y_pred):
# calculate your loss based on your two outputs path1, path2
pass
model.compile('adam', my_loss)
model.summary()

Batch normalization layer for CNN-LSTM

Suppose that I have a model like this (this is a model for time series forecasting):
ipt = Input((data.shape[1] ,data.shape[2])) # 1
x = Conv1D(filters = 10, kernel_size = 3, padding = 'causal', activation = 'relu')(ipt) # 2
x = LSTM(15, return_sequences = False)(x) # 3
x = BatchNormalization()(x) # 4
out = Dense(1, activation = 'relu')(x) # 5
Now I want to add batch normalization layer to this network. Considering the fact that batch normalization doesn't work with LSTM, Can I add it before Conv1D layer? I think it's rational to have a batch normalization layer after LSTM.
Also, where can I add Dropout in this network? The same places? (after or before batch normalization?)
What about adding AveragePooling1D between Conv1D and LSTM? Is it possible to add batch normalization between Conv1D and AveragePooling1D in this case without any effect on LSTM layer?
Update: the LayerNormalization implementation I was using was inter-layer, not recurrent as in the original paper; results with latter may prove superior.
BatchNormalization can work with LSTMs - the linked SO gives false advice; in fact, in my application of EEG classification, it dominated LayerNormalization. Now to your case:
"Can I add it before Conv1D"? Don't - instead, standardize your data beforehand, else you're employing an inferior variant to do the same thing
Try both: BatchNormalization before an activation, and after - apply to both Conv1D and LSTM
If your model is exactly as you show it, BN after LSTM may be counterproductive per ability to introduce noise, which can confuse the classifier layer - but this is about being one layer before output, not LSTM
If you aren't using stacked LSTM with return_sequences=True preceding return_sequences=False, you can place Dropout anywhere - before LSTM, after, or both
Spatial Dropout: drop units / channels instead of random activations (see bottom); was shown more effective at reducing coadaptation in CNNs in paper by LeCun, et al, w/ ideas applicable to RNNs. Can considerably increase convergence time, but also improve performance
recurrent_dropout is still preferable to Dropout for LSTM - however, you can do both; just do not use with with activation='relu', for which LSTM is unstable per a bug
For data of your dimensionality, any sort of Pooling is redundant and may harm performance; scarce data is better transformed via a non-linearity than simple averaging ops
I strongly recommend a SqueezeExcite block after your Conv; it's a form of self-attention - see paper; my implementation for 1D below
I also recommend trying activation='selu' with AlphaDropout and 'lecun_normal' initialization, per paper Self Normalizing Neural Networks
Disclaimer: above advice may not apply to NLP and embed-like tasks
Below is an example template you can use as a starting point; I also recommend the following SO's for further reading: Regularizing RNNs, and Visualizing RNN gradients
from keras.layers import Input, Dense, LSTM, Conv1D, Activation
from keras.layers import AlphaDropout, BatchNormalization
from keras.layers import GlobalAveragePooling1D, Reshape, multiply
from keras.models import Model
import keras.backend as K
import numpy as np
def make_model(batch_shape):
ipt = Input(batch_shape=batch_shape)
x = ConvBlock(ipt)
x = LSTM(16, return_sequences=False, recurrent_dropout=0.2)(x)
# x = BatchNormalization()(x) # may or may not work well
out = Dense(1, activation='relu')
model = Model(ipt, out)
model.compile('nadam', 'mse')
return model
def make_data(batch_shape): # toy data
return (np.random.randn(*batch_shape),
np.random.uniform(0, 2, (batch_shape[0], 1)))
batch_shape = (32, 21, 20)
model = make_model(batch_shape)
x, y = make_data(batch_shape)
model.train_on_batch(x, y)
Functions used:
def ConvBlock(_input): # cleaner code
x = Conv1D(filters=10, kernel_size=3, padding='causal', use_bias=False,
kernel_initializer='lecun_normal')(_input)
x = BatchNormalization(scale=False)(x)
x = Activation('selu')(x)
x = AlphaDropout(0.1)(x)
out = SqueezeExcite(x)
return out
def SqueezeExcite(_input, r=4): # r == "reduction factor"; see paper
filters = K.int_shape(_input)[-1]
se = GlobalAveragePooling1D()(_input)
se = Reshape((1, filters))(se)
se = Dense(filters//r, activation='relu', use_bias=False,
kernel_initializer='he_normal')(se)
se = Dense(filters, activation='sigmoid', use_bias=False,
kernel_initializer='he_normal')(se)
return multiply([_input, se])
Spatial Dropout: pass noise_shape = (batch_size, 1, channels) to Dropout - has the effect below; see Git gist for code:

Implementing a many-to-many LSTM in TensorFlow?

I am using TensorFlow to make predictions on time-series data. So it is like I have 50 tags and I want to find out the next possible 5 tags.
As shown in the following picture, I want to make it like the 4th structure.
I went through the tutorial demo: Recurrent Neural Networks
But I found it can provide like the 5th one in the above picture, which is different.
I am wondering which model could I use? I am thinking of the seq2seq models, but not sure if it is the right way.
You are right that you can use a seq2seq model. For brevity I've written up an example of how you can do it in Keras which also has a Tensorflow backend. I've not run the example so it might need tweaking. If your tags are one-hot you need to use cross-entropy loss instead.
from keras.models import Model
from keras.layers import Input, LSTM, RepeatVector
# The input shape is your sequence length and your token embedding size
inputs = Input(shape=(seq_len, embedding_size))
# Build a RNN encoder
encoder = LSTM(128, return_sequences=False)(inputs)
# Repeat the encoding for every input to the decoder
encoding_repeat = RepeatVector(5)(encoder)
# Pass your (5, 128) encoding to the decoder
decoder = LSTM(128, return_sequences=True)(encoding_repeat)
# Output each timestep into a fully connected layer
sequence_prediction = TimeDistributed(Dense(1, activation='linear'))(decoder)
model = Model(inputs, sequence_prediction)
model.compile('adam', 'mse') # Or categorical_crossentropy
model.fit(X_train, y_train)