Batch normalization layer for CNN-LSTM - tensorflow

Suppose that I have a model like this (this is a model for time series forecasting):
ipt = Input((data.shape[1] ,data.shape[2])) # 1
x = Conv1D(filters = 10, kernel_size = 3, padding = 'causal', activation = 'relu')(ipt) # 2
x = LSTM(15, return_sequences = False)(x) # 3
x = BatchNormalization()(x) # 4
out = Dense(1, activation = 'relu')(x) # 5
Now I want to add batch normalization layer to this network. Considering the fact that batch normalization doesn't work with LSTM, Can I add it before Conv1D layer? I think it's rational to have a batch normalization layer after LSTM.
Also, where can I add Dropout in this network? The same places? (after or before batch normalization?)
What about adding AveragePooling1D between Conv1D and LSTM? Is it possible to add batch normalization between Conv1D and AveragePooling1D in this case without any effect on LSTM layer?

Update: the LayerNormalization implementation I was using was inter-layer, not recurrent as in the original paper; results with latter may prove superior.
BatchNormalization can work with LSTMs - the linked SO gives false advice; in fact, in my application of EEG classification, it dominated LayerNormalization. Now to your case:
"Can I add it before Conv1D"? Don't - instead, standardize your data beforehand, else you're employing an inferior variant to do the same thing
Try both: BatchNormalization before an activation, and after - apply to both Conv1D and LSTM
If your model is exactly as you show it, BN after LSTM may be counterproductive per ability to introduce noise, which can confuse the classifier layer - but this is about being one layer before output, not LSTM
If you aren't using stacked LSTM with return_sequences=True preceding return_sequences=False, you can place Dropout anywhere - before LSTM, after, or both
Spatial Dropout: drop units / channels instead of random activations (see bottom); was shown more effective at reducing coadaptation in CNNs in paper by LeCun, et al, w/ ideas applicable to RNNs. Can considerably increase convergence time, but also improve performance
recurrent_dropout is still preferable to Dropout for LSTM - however, you can do both; just do not use with with activation='relu', for which LSTM is unstable per a bug
For data of your dimensionality, any sort of Pooling is redundant and may harm performance; scarce data is better transformed via a non-linearity than simple averaging ops
I strongly recommend a SqueezeExcite block after your Conv; it's a form of self-attention - see paper; my implementation for 1D below
I also recommend trying activation='selu' with AlphaDropout and 'lecun_normal' initialization, per paper Self Normalizing Neural Networks
Disclaimer: above advice may not apply to NLP and embed-like tasks
Below is an example template you can use as a starting point; I also recommend the following SO's for further reading: Regularizing RNNs, and Visualizing RNN gradients
from keras.layers import Input, Dense, LSTM, Conv1D, Activation
from keras.layers import AlphaDropout, BatchNormalization
from keras.layers import GlobalAveragePooling1D, Reshape, multiply
from keras.models import Model
import keras.backend as K
import numpy as np
def make_model(batch_shape):
ipt = Input(batch_shape=batch_shape)
x = ConvBlock(ipt)
x = LSTM(16, return_sequences=False, recurrent_dropout=0.2)(x)
# x = BatchNormalization()(x) # may or may not work well
out = Dense(1, activation='relu')
model = Model(ipt, out)
model.compile('nadam', 'mse')
return model
def make_data(batch_shape): # toy data
return (np.random.randn(*batch_shape),
np.random.uniform(0, 2, (batch_shape[0], 1)))
batch_shape = (32, 21, 20)
model = make_model(batch_shape)
x, y = make_data(batch_shape)
model.train_on_batch(x, y)
Functions used:
def ConvBlock(_input): # cleaner code
x = Conv1D(filters=10, kernel_size=3, padding='causal', use_bias=False,
kernel_initializer='lecun_normal')(_input)
x = BatchNormalization(scale=False)(x)
x = Activation('selu')(x)
x = AlphaDropout(0.1)(x)
out = SqueezeExcite(x)
return out
def SqueezeExcite(_input, r=4): # r == "reduction factor"; see paper
filters = K.int_shape(_input)[-1]
se = GlobalAveragePooling1D()(_input)
se = Reshape((1, filters))(se)
se = Dense(filters//r, activation='relu', use_bias=False,
kernel_initializer='he_normal')(se)
se = Dense(filters, activation='sigmoid', use_bias=False,
kernel_initializer='he_normal')(se)
return multiply([_input, se])
Spatial Dropout: pass noise_shape = (batch_size, 1, channels) to Dropout - has the effect below; see Git gist for code:

Related

XOR problem with 2-2-1 configuration should always predict output accurately?

I am trying to solve the XOR problem using the following code:
import numpy as np
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Input, Concatenate
from tensorflow.keras.utils import plot_model
from tensorflow.keras.optimizers import SGD, Adam
# input data
x = np.array([[0,0], [0,1], [1,0], [1,1]], 'float32')
y = np.array([[0], [1], [1], [0]], 'float32')
### Model
model = Sequential()
# add layers (architecture)
model.add(Dense(2, activation = 'relu')
model.add(Dense(1, activation = 'sigmoid'))
# compile
model.compile(loss = 'mean_squared_error',
optimizer = SGD(learning_rate = 0.1, momentum=0.8),
metrics = ['accuracy'])
# train
model.fit(x, y, epochs = 25000, batch_size = 1)
# evaluate
ev = model.evaluate(x, y)
I already tested:
using different activation functions in the hidden layer (sigmoid and tanh)
using different learning rates and momentum
Also, I am running with a high number of epochs (25000). Still, it only accurately predicts all outputs a few times. Most of the times accuracy is equal to 0.5 or 0.75.
I have read that this is the minimum configuration to solve this problem. However, it also seems that the error surface presents a number of regions with local minima.
My question is:
Should I assume that the model is correct and can learn the problem, although sometimes it gets 'stuck' in a local minima, OR do I still need to improve my model somehow to solve the XOR more accurately and consistently?

LSTM training error is very high and relatively unchanging

As a learning exercise, I'm trying to use an LSTM model with the Keras framework to predict the stock market based on multiple data points. The size of my input array is roughly [5000, 100]. Based on other questions on this site and articles online, the approach seems fairly standard: put the data in a numpy array, scale it, reshape it to 3 dimensions for the LSTM, split it into train and test sections, and feed it through the model. Running only the training portion of the model, I am consistently getting loss scores around 400,000,000. This is not changed by altering the batch size, the number of epochs, the number of layers, replacing the normalization with dropout layers, changing the sizes of each layer, or using different optimizers and loss functions. Any idea why the loss is so high and what I can do to fix that? Attached is the code. All advice is greatly appreciated.
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, losses, optimizers, Model, preprocessing
from keras.utils import plot_model
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
scaler = MinMaxScaler(feature_range=(0, 1))
features_df = pd.read_csv("dataset.csv")
features_np = np.array(features_df)
features_np.astype(np.float64)
scaler.fit_transform(features_np)
num_features=features_np.shape[1]
features = np.reshape(features_np, (features_np.shape[0], 1, features_np.shape[1]))
labels_np = np.array(pd.read_csv("output.csv"))
scaler.fit_transform(labels_np)
test_in = features_np[int(features_np.shape[0] * 0.75):]
test_in = np.reshape(test_in, (test_in.shape[0], 1, test_in.shape[1]))
test_out = labels_np[int(labels_np.shape[0] * 0.75):]
test_out = np.reshape(test_out, (test_out.shape[0], 1, test_out.shape[1]))
inputs = layers.Input(shape=(1, features.shape[2]))
x = layers.LSTM(5000, return_sequences=True)(inputs)
lstm1 = layers.LSTM(1000, return_sequences=True)(x)
norm1 = layers.BatchNormalization()(lstm1)
lstm2 = layers.LSTM(1000, return_sequences=True)(norm1)
lstm3 = layers.LSTM(1000, return_sequences=True)(lstm2)
norm2 = layers.BatchNormalization()(lstm3)
lstm4 = layers.LSTM(1000, return_sequences=True)(norm2)
lstm5 = layers.LSTM(1000)(lstm4)
dense1 = layers.Dense(1000, activation='relu')(lstm5)
dense2 = layers.Dense(1000, activation='sigmoid')(dense1)
outputs = layers.Dense(2)(dense2)
model = Model(inputs=inputs, outputs=outputs)
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(features, labels_np, epochs=1, batch_size=4)
evaluate = model.evaluate(test_in, test_out, verbose=2)
While I have not solved the error, implementing the Sequential() model and using only two LSTM layers and a Dense layer changed the error: the training error is now very low while testing remains high. This now appears to be a (relatively) simple problem of overfitting rather than the more confusing error of high training loss. Hopefully, this helps anyone having a similar problem.
There are two things i notice and dont understand why you use them. First one is , dense2 layer with sigmoid activation. I dont think sigmoid activation is benefical to when we are trying to solve a regression problem. Can you change that to relu and see what happens. Second one is you have two dense layers. You did not specify that but i think you are predicting two values with same inputs. If you are trying to predict just one value, you should you should change that to
outputs = layers.Dense(1)(dense2)

Training Resnet-50 with CIFAR-100 dataset in TensorFlow, can't get good accuracy

I am trying to a resnet-50 model in tensorflow by cifar-100 dataset.I have used builtin resnet_v1_50 to create model in tensorflow with two fully connected layer on it's head.But my validation accuracy stuck at nearly 37%.What is the problem???am I configure wrongly define and configure resnet_v1_50??? my model creation code is given below.
import tensorflow as tf
from tensorflow.contrib.slim.python.slim.nets import resnet_v1
X = tf.placeholder(dtype=tf.float32, shape=[None, 32, 32, 3])
Y = tf.placeholder(dtype=tf.float32, shape=[None, 100])
net, end_points = resnet_v1.resnet_v1_50(X,global_pool=False,is_training=True)
flattened = tf.contrib.layers.flatten(net)
dense_fc1 = tf.layers.dense(inputs=flattened,units=625, activation=tf.nn.relu,kernel_initializer=tf.contrib.layers.xavier_initializer())
dropout_fc1 = tf.layers.dropout(inputs=dense_fc1,rate=0.5, training=self.training)
logits = tf.layers.dense(inputs=dropout_fc1, units=num_classes,kernel_initializer = tf.contrib.layers.xavier_initializer())
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y))
optimizer = tf.train.AdamOptimizer(learning_rate=0.001).minimize(cost)
I think you have an extra dense layer. ResNet uses single fully-connected layer with softmax and size=num_classes.
You might also need to make sure that your hyperparameters are set correctly, like learning_rate and weight_decay and your input processing pipeline is also correct.
Here is an extra link to see if your pipeline is similar to a working solution.

Keras CNN overfitting for more than four classes

I'm trying to train a classifier on Google QuickDraw drawings using Keras:
import numpy as np
from tensorflow.keras.layers import Conv2D, Dense, Flatten, MaxPooling2D
from tensorflow.keras.models import Sequential
model = Sequential()
model.add(Conv2D(filters=32, kernel_size=5, data_format="channels_last", activation="relu", input_shape=(28, 28, 1)))
model.add(MaxPooling2D(data_format="channels_last"))
model.add(Conv2D(filters=16, kernel_size=3, data_format="channels_last", activation="relu"))
model.add(MaxPooling2D(data_format="channels_last"))
model.add(Flatten(data_format="channels_last"))
model.add(Dense(units=128, activation="relu"))
model.add(Dense(units=64, activation="relu"))
model.add(Dense(units=4, activation="softmax"))
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
x = np.load("./x.npy")
y = np.load("./y.npy")
model.fit(x=x, y=y, batch_size=100, epochs=40, validation_split=0.2)
The input data is a 4d array with 12000 normalized images (28 x 28 x 1) per class. The output data is an array of one hot encoded vectors.
If I train this model on four classes, it produces convincing results:
(red is training data, blue is validation data)
I know the model is slightly overfitted. However, I want to keep the architecture as simple as possible, so I accepted that.
My problem is that as soon as I add just one arbitrary class, the model starts to overfit extremely:
I tried many different things to prevent it from overfitting such as Batch Normalization, Dropout, Kernel Regularizers, much more training data and different batch sizes, none of which caused any significant improvement.
What could be the reason why my CNN overfits so much?
EDIT: This is the code I used to create x.npy and y.npy:
import numpy as np
from tensorflow.keras.utils import to_categorical
files = ['cat.npy', 'dog.npy', 'apple.npy', 'banana.npy', 'flower.npy']
SAMPLES = 12000
x = np.concatenate([np.load(f'./data/{f}')[:SAMPLES] for f in files]) / 255.0
y = np.concatenate([np.full(SAMPLES, i) for i in range(len(files))])
# (samples, rows, cols, channels)
x = x.reshape(x.shape[0], 28, 28, 1).astype('float32')
y = to_categorical(y)
np.save('./x.npy', x)
np.save('./y.npy', y)
The .npy files come from here.
The problem lies with how the data split is done. Notice that there are 5 classes and you do 0.2 validation split. By default there's no shuffling and in your code you feed the data in a sequential order. What that means:
Training data consists entirely of 4 classes: 'cat.npy', 'dog.npy', 'apple.npy', 'banana.npy'. That's the 0.8 training split.
Test data is 'flower.npy'. That's your 0.2 validation split. The model was never trained on this so it gets terrible accuracy.
Such results are only possible thanks to the fact that the validation_split=0.2, so you get close to perfect class separation.
Solution
x = np.load("./x.npy")
y = np.load("./y.npy")
# Shuffle the data!
p = np.random.permutation(len(x))
x = x[p]
y = y[p]
model.fit(x=x, y=y, batch_size=100, epochs=40, validation_split=0.2)
if my hypothesis is correct, setting the validation_split to e.g. 0.5 should also get you much better results (though it's not a solution).

Unable to track record by record processing in LSTM algorithm for text classification?

We are working on multi-class text classification and following is the process which we have used.
1) We have created 300 dim's vector with word2vec word embedding using our own data and then passed that vector as a weights to LSTM embedding layer.
2) And then we have used one LSTM layer and one dense layer.
Here below is my code:
input_layer = layers.Input((train_seq_x.shape[1], ))
embedding_layer = layers.Embedding(len(word_index)+1, 300, weights=[embedding_matrix], trainable=False)(input_layer)
embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)
lstm_layer1 = layers.LSTM(300,return_sequences=True,activation="relu")(embedding_layer)
lstm_layer1 = layers.Dropout(0.5)(lstm_layer1)
flat_layer = layers.Flatten()(lstm_layer1)
output_layer = layers.Dense(33, activation="sigmoid")(flat_layer)
model = models.Model(inputs=input_layer, outputs=output_layer)
model.compile(optimizer=optimizers.Adam(), loss='categorical_crossentropy',metrics=['accuracy'])
Please help me out on the below questions:
Q1) Why did we pass word embedding vector(300 dim's) as weights in LSTM embedding layer?
Q2) How can we know optimal number of neural in LSTM layer?
Q3) Can you please explain how the single record processing in LSTM algorithm?
Please let me know if you requires more information on the same.
Q1) Why did we pass word embedding vector(300 dim's) as weights in
LSTM embedding layer?
In a very simplistic way, you can think of an embedding layers as a lookup table which converts a word (represented by its index in a dictionary) to a vector. It is a trainable layers. Since you have already trained word embeddings instead of initializing the embedding layer with the random weight you initialize it with the vectors you have learned.
Embedding(len(word_index)+1, 300, weights=[embedding_matrix], trainable=False)(input_layer)
So here you are
creating an embedding layer or a look up table which can lookup words
indices 0 to len(word_index).
Each lookuped up word will map to a vector of size 300.
This lookup table is loaded with the vectors from "embedding_matrix"
(which is a pretrained model).
trainable=False will freez the weight in this layer.
You have passed 300 because it is the vector size of your pretrained model (embedding_matrix)
Q2) How can we know optimal number of neural in LSTM layer?
You have created a LSTM layer with takes 300 size vector as input and returns a vector of size 300. The output size and number of stacked LSTMS are hyperparameters which is tuned manually (usually using KFold CV)
Q3) Can you please explain how the single record processing in LSTM
algorithm?
A single record/sentence(s) are converted into indices of the vocabulary. So for every sentence you have an array of indices.
A batch of these sentences are created and feed as input to the model.
LSTM is unwrapped by passing in one index at a time as input at each timestep.
Finally the ouput of the LSTM is forward propagated by a final dense
layer to size 33. So looks like each input is mapped to one of 33
classes in your case.
Simple example
import numpy as np
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten, LSTM
from keras.layers.embeddings import Embedding
from nltk.lm import Vocabulary
from keras.utils import to_categorical
training_data = [ "it was a good movie".split(), "it was a bad movie".split()]
training_target = [1, 0]
v = Vocabulary([word for s in training_data for word in s])
model = Sequential()
model.add(Embedding(len(v),50,input_length = 5, dropout = 0.2))
model.add(LSTM(10, dropout_U = 0.2, dropout_W = 0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())
x = np.array([list(map(lambda x: v[x], s)) for s in training_data])
y = to_categorical(training_target)
model.fit(x,y)