LSTM training error is very high and relatively unchanging - numpy

As a learning exercise, I'm trying to use an LSTM model with the Keras framework to predict the stock market based on multiple data points. The size of my input array is roughly [5000, 100]. Based on other questions on this site and articles online, the approach seems fairly standard: put the data in a numpy array, scale it, reshape it to 3 dimensions for the LSTM, split it into train and test sections, and feed it through the model. Running only the training portion of the model, I am consistently getting loss scores around 400,000,000. This is not changed by altering the batch size, the number of epochs, the number of layers, replacing the normalization with dropout layers, changing the sizes of each layer, or using different optimizers and loss functions. Any idea why the loss is so high and what I can do to fix that? Attached is the code. All advice is greatly appreciated.
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, losses, optimizers, Model, preprocessing
from keras.utils import plot_model
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
scaler = MinMaxScaler(feature_range=(0, 1))
features_df = pd.read_csv("dataset.csv")
features_np = np.array(features_df)
features_np.astype(np.float64)
scaler.fit_transform(features_np)
num_features=features_np.shape[1]
features = np.reshape(features_np, (features_np.shape[0], 1, features_np.shape[1]))
labels_np = np.array(pd.read_csv("output.csv"))
scaler.fit_transform(labels_np)
test_in = features_np[int(features_np.shape[0] * 0.75):]
test_in = np.reshape(test_in, (test_in.shape[0], 1, test_in.shape[1]))
test_out = labels_np[int(labels_np.shape[0] * 0.75):]
test_out = np.reshape(test_out, (test_out.shape[0], 1, test_out.shape[1]))
inputs = layers.Input(shape=(1, features.shape[2]))
x = layers.LSTM(5000, return_sequences=True)(inputs)
lstm1 = layers.LSTM(1000, return_sequences=True)(x)
norm1 = layers.BatchNormalization()(lstm1)
lstm2 = layers.LSTM(1000, return_sequences=True)(norm1)
lstm3 = layers.LSTM(1000, return_sequences=True)(lstm2)
norm2 = layers.BatchNormalization()(lstm3)
lstm4 = layers.LSTM(1000, return_sequences=True)(norm2)
lstm5 = layers.LSTM(1000)(lstm4)
dense1 = layers.Dense(1000, activation='relu')(lstm5)
dense2 = layers.Dense(1000, activation='sigmoid')(dense1)
outputs = layers.Dense(2)(dense2)
model = Model(inputs=inputs, outputs=outputs)
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(features, labels_np, epochs=1, batch_size=4)
evaluate = model.evaluate(test_in, test_out, verbose=2)

While I have not solved the error, implementing the Sequential() model and using only two LSTM layers and a Dense layer changed the error: the training error is now very low while testing remains high. This now appears to be a (relatively) simple problem of overfitting rather than the more confusing error of high training loss. Hopefully, this helps anyone having a similar problem.

There are two things i notice and dont understand why you use them. First one is , dense2 layer with sigmoid activation. I dont think sigmoid activation is benefical to when we are trying to solve a regression problem. Can you change that to relu and see what happens. Second one is you have two dense layers. You did not specify that but i think you are predicting two values with same inputs. If you are trying to predict just one value, you should you should change that to
outputs = layers.Dense(1)(dense2)

Related

XOR problem with 2-2-1 configuration should always predict output accurately?

I am trying to solve the XOR problem using the following code:
import numpy as np
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Input, Concatenate
from tensorflow.keras.utils import plot_model
from tensorflow.keras.optimizers import SGD, Adam
# input data
x = np.array([[0,0], [0,1], [1,0], [1,1]], 'float32')
y = np.array([[0], [1], [1], [0]], 'float32')
### Model
model = Sequential()
# add layers (architecture)
model.add(Dense(2, activation = 'relu')
model.add(Dense(1, activation = 'sigmoid'))
# compile
model.compile(loss = 'mean_squared_error',
optimizer = SGD(learning_rate = 0.1, momentum=0.8),
metrics = ['accuracy'])
# train
model.fit(x, y, epochs = 25000, batch_size = 1)
# evaluate
ev = model.evaluate(x, y)
I already tested:
using different activation functions in the hidden layer (sigmoid and tanh)
using different learning rates and momentum
Also, I am running with a high number of epochs (25000). Still, it only accurately predicts all outputs a few times. Most of the times accuracy is equal to 0.5 or 0.75.
I have read that this is the minimum configuration to solve this problem. However, it also seems that the error surface presents a number of regions with local minima.
My question is:
Should I assume that the model is correct and can learn the problem, although sometimes it gets 'stuck' in a local minima, OR do I still need to improve my model somehow to solve the XOR more accurately and consistently?

Why can't I classify my data perfectly on this simple problem using a NN?

I have a set of observations made of 10 features, each of these features being a real number in the interval (0,2). Say I wanted to train a simple neural network to classify whether the average of those features is above or below 1.0.
Unless I'm missing something, it should be enough with a two-layer network with one neuron on each layer. The activation functions would be a linear one (i.e. no activation function) on the first layer and a sigmoid on the output layer. An example of a NN with this architecture that would work is one that calculates the average on the first layer (i.e. all weights = 0.1 and bias=0) and asseses whether that is above or below 1.0 in the second layer (i.e. weight = 1.0 and bias = -1.0).
When I implement this using TensorFlow (see code below), I obviously get a very high accuracy quite quickly, but never get to 100% accuracy... I would like some help to understand conceptually why this is the case. I don't see why the backppropagation algorithm does not reach a set of optimal weights (may be this is related with the loss function I'm using, which has local minmums?). Also I would like to know whether a 100% accuracy is achievable if I use different activations and/or loss function.
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
X = [np.random.random(10)*2.0 for _ in range(10000)]
X = np.array(X)
y = X.mean(axis=1) >= 1.0
y = y.astype('int')
train_ratio = 0.8
train_len = int(X.shape[0]*0.8)
X_train, X_test = X[:train_len,:], X[train_len:,:]
y_train, y_test = y[:train_len], y[train_len:]
def create_classifier(lr = 0.001):
classifier = tf.keras.Sequential()
classifier.add(tf.keras.layers.Dense(units=1))
classifier.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))#, input_shape=input_shape))
optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
metrics=[tf.keras.metrics.BinaryAccuracy()],
classifier.compile(optimizer=optimizer, loss=tf.keras.losses.BinaryCrossentropy(from_logits=False), metrics=metrics)
return classifier
classifier = create_classifier(lr = 0.1)
history = classifier.fit(X_train, y_train, batch_size=1000, validation_split=0.1, epochs=2000)
Ignoring the fact that a neural network is an odd approach for this problem, and answering your specific question - it looks like your learning rate might be too high which could explain the fluctuations around the optimal point.

Learning a Categorical Variable with TensorFlow Probability

I would like to use TFP to write a neural network where the output are the probabilities of a categorical variable with 3 classes, and train it using the negative log-likelihood.
As I'm moving my first steps with TF and TFP, I started with a toy model where the input layer has only 1 unit receiving a null input, and the output layer has 3 units with softmax activation function. The idea is that the biases should learn (up to an additive constant) the log of the probabilities.
Here below is my code, true_p are the true parameters I use to generate the data and I would like to learn, while learned_p is what I get from the NN.
import numpy as np
import tensorflow as tf
from tensorflow import keras
from functions import nll
from tensorflow.keras.optimizers import SGD
import tensorflow.keras.layers as layers
import tensorflow_probability as tfp
tfd = tfp.distributions
# params
true_p = np.array([0.1, 0.7, 0.2])
n_train = 1000
# training data
x_train = np.array(np.zeros(n_train)).reshape((n_train,))
y_train = np.array(np.random.choice(len(true_p), size=n_train, p=true_p)).reshape((n_train,))
# model
input_layer = layers.Input(shape=(1,))
p_layer = layers.Dense(len(true_p), activation=tf.nn.softmax)(input_layer)
p_y = tfp.layers.DistributionLambda(tfd.Categorical)(p_layer)
model_p = keras.models.Model(inputs=input_layer, outputs=p_y)
model_p.compile(SGD(), loss=nll)
# training
hist_p = model_p.fit(x=x_train, y=y_train, batch_size=100, epochs=3000, verbose=0)
# check result
learned_p = np.round(model_p.layers[1].call(tf.constant([0], shape=(1, 1))).numpy(), 3)
learned_p
With this setup, I get the result:
>>> learned_p
array([[0.005, 0.989, 0.006]], dtype=float32)
I over-estimate the second category, and can't really distinguish between the first and the third one. What's worst, if I plot the probabilities at the end of each epoch, it looks like they are converging monotonically to the vector [0,1,0], which doesn't make sense (it seems to me the gradient should push in the opposite direction once I start to over-estimate).
I really can't figure out what's going on here, but have the feeling I'm doing something plain wrong. Any idea? Thank you for your help!
For the record, I also tried using other optimizers like Adam or Adagrad playing with the hyper-params, but with no luck.
I'm using Python 3.7.9, TensorFlow 2.3.1 and TensorFlow probability 0.11.1
I believe the default argument to Categorical is not the vector of probabilities, but the vector of logits (values you'd take softmax of to get probabilities). This is to help maintain precision in internal Categorical computations like log_prob. I think you can simply eliminate the softmax activation function and it should work. Please update if it doesn't!
EDIT: alternatively you can replace the tfd.Categorical with
lambda p: tfd.Categorical(probs=p)
but you'll lose the aforementioned precision gains. Just wanted to clarify that passing probs is an option, just not the default.

Unable to track record by record processing in LSTM algorithm for text classification?

We are working on multi-class text classification and following is the process which we have used.
1) We have created 300 dim's vector with word2vec word embedding using our own data and then passed that vector as a weights to LSTM embedding layer.
2) And then we have used one LSTM layer and one dense layer.
Here below is my code:
input_layer = layers.Input((train_seq_x.shape[1], ))
embedding_layer = layers.Embedding(len(word_index)+1, 300, weights=[embedding_matrix], trainable=False)(input_layer)
embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)
lstm_layer1 = layers.LSTM(300,return_sequences=True,activation="relu")(embedding_layer)
lstm_layer1 = layers.Dropout(0.5)(lstm_layer1)
flat_layer = layers.Flatten()(lstm_layer1)
output_layer = layers.Dense(33, activation="sigmoid")(flat_layer)
model = models.Model(inputs=input_layer, outputs=output_layer)
model.compile(optimizer=optimizers.Adam(), loss='categorical_crossentropy',metrics=['accuracy'])
Please help me out on the below questions:
Q1) Why did we pass word embedding vector(300 dim's) as weights in LSTM embedding layer?
Q2) How can we know optimal number of neural in LSTM layer?
Q3) Can you please explain how the single record processing in LSTM algorithm?
Please let me know if you requires more information on the same.
Q1) Why did we pass word embedding vector(300 dim's) as weights in
LSTM embedding layer?
In a very simplistic way, you can think of an embedding layers as a lookup table which converts a word (represented by its index in a dictionary) to a vector. It is a trainable layers. Since you have already trained word embeddings instead of initializing the embedding layer with the random weight you initialize it with the vectors you have learned.
Embedding(len(word_index)+1, 300, weights=[embedding_matrix], trainable=False)(input_layer)
So here you are
creating an embedding layer or a look up table which can lookup words
indices 0 to len(word_index).
Each lookuped up word will map to a vector of size 300.
This lookup table is loaded with the vectors from "embedding_matrix"
(which is a pretrained model).
trainable=False will freez the weight in this layer.
You have passed 300 because it is the vector size of your pretrained model (embedding_matrix)
Q2) How can we know optimal number of neural in LSTM layer?
You have created a LSTM layer with takes 300 size vector as input and returns a vector of size 300. The output size and number of stacked LSTMS are hyperparameters which is tuned manually (usually using KFold CV)
Q3) Can you please explain how the single record processing in LSTM
algorithm?
A single record/sentence(s) are converted into indices of the vocabulary. So for every sentence you have an array of indices.
A batch of these sentences are created and feed as input to the model.
LSTM is unwrapped by passing in one index at a time as input at each timestep.
Finally the ouput of the LSTM is forward propagated by a final dense
layer to size 33. So looks like each input is mapped to one of 33
classes in your case.
Simple example
import numpy as np
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten, LSTM
from keras.layers.embeddings import Embedding
from nltk.lm import Vocabulary
from keras.utils import to_categorical
training_data = [ "it was a good movie".split(), "it was a bad movie".split()]
training_target = [1, 0]
v = Vocabulary([word for s in training_data for word in s])
model = Sequential()
model.add(Embedding(len(v),50,input_length = 5, dropout = 0.2))
model.add(LSTM(10, dropout_U = 0.2, dropout_W = 0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())
x = np.array([list(map(lambda x: v[x], s)) for s in training_data])
y = to_categorical(training_target)
model.fit(x,y)

keras batchnorm layer moving average for variance

I've been trying to understand the Keras BatchNorm layer behavior in my Keras NN model. One question I encountered was how the BN layer is calculating the moving average of the 'variance'. My understanding is Keras is using exponential-weighted-average method to calculate the moving average for both mean and variance from the training mini-batches. But regardless of this, after a really large number of epochs, this moving average should approach the mean/variance of the training data set. But in my simple example, the 'variance' moving average is always different from the training data 'variance'. Below is my code and output:
from keras.layers import Input, BatchNormalization
from keras.models import Model
from keras.optimizers import Adam, RMSprop
import numpy as np
X_input = Input(shape=(6,))
X = BatchNormalization(axis=-1)(X_input)
model = Model(inputs=X_input, outputs=X)
model.compile(optimizer=RMSprop(), loss='mean_squared_error')
np.random.seed(3)
train_data = np.random.random((5,6))
train_label = np.random.random((5,6))
model.fit(x=train_data, y=train_label, epochs=10000, batch_size=6, verbose=False)
bn_gamma, bn_beta, bn_mean, bn_var = model.layers[1].get_weights()
train_mean = np.mean(train_data, axis=0)
train_var = np.var(train_data, axis=0)
print("train_mean: {}".format(train_mean))
print("moving_mean: {}".format(bn_mean))
print("train_var: {}".format(train_var))
print("moving_var: {}".format(bn_var))
Below is the output:
train_mean: [0.42588575 0.47785879 0.32170309 0.49151921 0.355046 0.60104636]
moving_mean: [0.4258843 0.47785735 0.32170165 0.49151778 0.35504454 0.60104346]
train_var: [0.03949981 0.05228663 0.04027516 0.02522536 0.10261097 0.0838988 ]
moving_var: [0.04938692 0.06537427 0.05035637 0.03153942 0.12829503 0.10489936]
If you see, the train_mean is the same as the moving average mean of BN layer, but train_var (variance) is not. Can anyone please help here? Thanks.
If you look at the source code of batchnorm, you can see that the unbiased estimator of population variance is used, here is the relevant line:
variance *= sample_size / (sample_size - (1.0 + self.epsilon))
In your case, the sample size is 5, so you should have train_var * 5./4 == moving_var, which is the case.