How to adjust Model for rare binary outcome with Tensorflow or GBM - tensorflow

I'm currently working on data with rare binary outcome, i.e. the response vector contains mostly 0 and only a few 1 (approximately 1.5% ones). I've got about 20 continuous explanatory variables. I tried to train models using GBM, Random Forests, TensorFlow with Keras backend.
I observed a special behavior of the models, regardless which method I used:
The accuracy is high (~98%) but the model predicts probabilities for class "0" for all outcomes as ~98.5% and for class "1" ~1,5%.
How can I prevent this behavior?
I'm using RStudio. For Example a TF model with Keras would be:
model <- keras_model_sequential()
model %>%
layer_dense(units = 256, activation = "relu", input_shape = c(20)) %>%
layer_dense(units = 256, activation = "relu") %>%
layer_dense(units = 2, activation = "sigmoid")
parallel_model <- multi_gpu_model(model, gpus=2)
parallel_model %>% compile(
optimizer = "adam",
loss = "binary_crossentropy",
metrics = "binary_accuracy")
histroy <- parallel_model %>% fit(
x_train, y_train,
batch_size = 64,
epochs = 100,
class_weight = list("0"=1,"1"=70),
verbose = 1,
validation_split = 0.2
)
But my observation is not limited to TF. This makes my question more general. I'm not asking for specific adjustments for the model above, rather I'd like to discuss at what point all outcomes are assigned the same probability.
I can guess, the issue is connected to the loss-function.
I know there is no way to use AUC as loss functions, since it's not differentiable. If I test models with AUC with unknown data, the result is not better than random guessing.
I don't mind answers with code in Python, since this isn't a problem about coding rather than general behavior and algorithms.

When your problem has unbalanced classes, I suggest using SMOTE (on the training data only!!! never use smote on your testing data!!!) before training the model.
For example:
from imblearn.over_sampling import SMOTE
X_trn_balanced, Y_trn_balanced = SMOTE(random_state=1, ratio=1).fit_sample(X_trn, Y_trn)
#next fit the model with the balanced data
model.fit(X_trn_balanced, Y_trn_balanced )

In my (not so big) experience with AUC problems and rare positives, I see models with one class (not two). It's either "result is positive (1)" or "result is negative (0)".
Metrics like accuracy are useless for these problems, you should use AUC based metrics with big batch sizes.
For these problems, it doesn't matter whether the outcome probabilities are too little, as long as there is a difference between them. (Forests, GBM, etc. will indeed output these little values, but this is not a problem)
For neural networks, you can try to use class weights to increase the output probabilities. But notice that if you split the result in two separate classes (considering only one class should be positive), it doesn't matter if you use weights, because:
For the first class, low weights: predict all ones is good
For the second class, high weights: predict all zeros is good (weighted to very good)
So, as an initial solution, you can:
Use a 'softmax' activation (to guarantee your model will have only one correct output) and a 'categorical_crossentropy' loss.
(Or, preferrably) Use a model with only one class and keep 'sigmoid' with 'binary_crossentropy'.
I always work with the preferrable option above. In this case, if you use batch sizes that are big enough to contain one or two positive examples (batch size around 100 for you), weights may even be discarded. If the batch sizes are too little and many batches don't contain positive results, you may have too many weight updates towards plain zeros, which is bad.
You may also resample your data and, for instance, multiply by 10 the number of positive examples, so your batches contain more positives and training becomes easier.
Example of AUC metric to determine when training should end:
#in python - considering outputs with only one class
def aucMetric(true, pred):
true= K.flatten(true)
pred = K.flatten(pred)
totalCount = K.shape(true)[0]
values, indices = tf.nn.top_k(pred, k = totalCount)
sortedTrue = K.gather(true, indices)
tpCurve = K.cumsum(sortedTrue)
negatives = 1 - sortedTrue
auc = K.sum(tpCurve * negatives)
totalCount = K.cast(totalCount, K.floatx())
positiveCount = K.sum(true)
negativeCount = totalCount - positiveCount
totalArea = positiveCount * negativeCount
return auc / totalArea

Related

Why can't I classify my data perfectly on this simple problem using a NN?

I have a set of observations made of 10 features, each of these features being a real number in the interval (0,2). Say I wanted to train a simple neural network to classify whether the average of those features is above or below 1.0.
Unless I'm missing something, it should be enough with a two-layer network with one neuron on each layer. The activation functions would be a linear one (i.e. no activation function) on the first layer and a sigmoid on the output layer. An example of a NN with this architecture that would work is one that calculates the average on the first layer (i.e. all weights = 0.1 and bias=0) and asseses whether that is above or below 1.0 in the second layer (i.e. weight = 1.0 and bias = -1.0).
When I implement this using TensorFlow (see code below), I obviously get a very high accuracy quite quickly, but never get to 100% accuracy... I would like some help to understand conceptually why this is the case. I don't see why the backppropagation algorithm does not reach a set of optimal weights (may be this is related with the loss function I'm using, which has local minmums?). Also I would like to know whether a 100% accuracy is achievable if I use different activations and/or loss function.
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
X = [np.random.random(10)*2.0 for _ in range(10000)]
X = np.array(X)
y = X.mean(axis=1) >= 1.0
y = y.astype('int')
train_ratio = 0.8
train_len = int(X.shape[0]*0.8)
X_train, X_test = X[:train_len,:], X[train_len:,:]
y_train, y_test = y[:train_len], y[train_len:]
def create_classifier(lr = 0.001):
classifier = tf.keras.Sequential()
classifier.add(tf.keras.layers.Dense(units=1))
classifier.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))#, input_shape=input_shape))
optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
metrics=[tf.keras.metrics.BinaryAccuracy()],
classifier.compile(optimizer=optimizer, loss=tf.keras.losses.BinaryCrossentropy(from_logits=False), metrics=metrics)
return classifier
classifier = create_classifier(lr = 0.1)
history = classifier.fit(X_train, y_train, batch_size=1000, validation_split=0.1, epochs=2000)
Ignoring the fact that a neural network is an odd approach for this problem, and answering your specific question - it looks like your learning rate might be too high which could explain the fluctuations around the optimal point.

How to design an optimal CNN? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I am working on a Ph.D. project, which objective is to reduce CO2 emissions on Earth.
I have a dataset, and I was able to successfully implement a CNN, which gives 80% accuracy (worst-case scenario). However, the field where I work is very demanding, and I have the impression that I could get better accuracy with a well-optimized CNN.
How do experts design CNN's? How could I choose between Inception Modules, Dropout Regularization, Batch Normalization, convolutional filter size, size and depth of convolutional channels, number of fully-connected layers, activations neurons, etc? How do people navigate this large optimization problem in a scientific manner? The combinations are endless. Are there any real-life examples where this problem is navigated, addressing its full complexity (not just optimizing a few hyper-parameters)?
Hopefully, my dataset is not too large, so the CNN models that I am considering should have very few parameters.
How do experts design CNN's? How could I choose between Inception Modules, Dropout Regularization, Batch Normalization, convolutional filter size, size and depth of convolutional channels, number of fully-connected layers, activations neurons, etc? How do people navigate this large optimization problem in a scientific manner? The combinations are endless.
You said truly that the combinations are huge in number. And without approaching rightly you may end up with nowhere. A great one said machine Learning is an art, not science. Results are data-dependent. Here are a few tips regarding your above concern.
Log Everything: In the training time, save necessary logs of every experiment such as training loss, validation loss, weight files, execution times, visualization, etc. Some of them can be saved with CSVLogger, ModelCheckpoint etc. TensorBoard is a great tool for inspecting both training log and visualization and many more.
Strong Validation Strategies: This is very important. To build a stable Cross-Validation (CV), we must have a good understanding of the data and the challenges faced. We’ll check and make sure the validation set has a similar distribution to the training set and test set. And We’ll try to make sure our models improve both on our CV and on the test set (if gt is available for the test set). Basically, partitioning the data randomly is usually not enough to satisfy this. Understanding the data and how we can partition it without introducing a data leakage in our CV is key to avoid overfitting.
Change Only One: During the experiment, change one thing at a time and save the observations (logs) for those changes. For example: change the image size gradually from 224 (for example) to higher and observe the results. We should start with a small combination. While experimenting with image size, fix others like model architecture, learning rate, etc. The same goes for the learning rate part or model architectures. However, later we also may need to change more than one when we get some promising combinations. In kaggle competition, these are very common approaches one would follow. Below is a very simple example regarding this. But it's not limited any way.
However, as you said, your Ph.D. project is to reduce CO2 emissions on Earth. In my understanding, these are more application-specific problems and less than the algorithm-specific problems. So, we think it's better to take benefit from well-recognized pre-trained models.
In case if we wish to write our CNN on our own, we should give a decent time on it. Start with a very simple one, for example:
Conv2D (16, 3, 'relu') - > MaxPool (2)
Conv2D (32, 3, 'relu') - > MaxPool (2)
Conv2D (64, 3, 'relu') - > MaxPool (2)
Conv2D (128, 3, 'relu') - > MaxPool (2)
Here we gradually increase the depth but reducing the feature dimension. By the end layer, more semantic information would emerge. While stacking Conv2D layers, it's common practice to increase the channel depth in such order 16, 32, 64, 128 etc. If we want to impute Inception or Residual Block inside our network, I think, we should do some basic math first about what feature properties will come out of this, etc. Following a concept like this, we may also wish to look at approaches like SENet, ResNeSt etc. About Dropout, if we observe that our model is getting overfitted during training, then we should add some. In the final layer, we may want to choose GlobalAveragePooling over the Flatten layer (FCC). We can probably now understand that there are lots of ablation studies that need to be done to get a satisfactory CNN model.
In this regard, We suggest you explore the two most important things: (1). Read one of the pre-trained model papers/blogs/videos about their strategies to build the algorithm. For example: check out this EfficientNet Explained. (2). Next, explore the source code of it. That would give your more sense and encourage you to build your own giant.
We like to end this with one last working example. See the model diagram below, it's a small inception network, source. If we look closely, we will see, it consists of the following three modules.
Conv Module
Inception Module
Downsample Modul
Take a close look at each module's configuration such as filter size, strides, etc. Let's try to understand and implement this module. Before that, here are two good references (1, 2) for the Inception concept to refresh the concept.
Conv Module
From the diagram we can see, it consists of one convolutional network, one batch normalization, and one relu activation. Also, it produces C times feature maps with K x K filters and S x S strides. To do that, we will create a class object that will inherit the tf.keras.layers.Layer classes
class ConvModule(tf.keras.layers.Layer):
def __init__(self, kernel_num, kernel_size, strides, padding='same'):
super(ConvModule, self).__init__()
# conv layer
self.conv = tf.keras.layers.Conv2D(kernel_num,
kernel_size=kernel_size,
strides=strides, padding=padding)
# batch norm layer
self.bn = tf.keras.layers.BatchNormalization()
def call(self, input_tensor, training=False):
x = self.conv(input_tensor)
x = self.bn(x, training=training)
x = tf.nn.relu(x)
return x
Inception Module
Next comes the Inception module. According to the above graph, it consists of two convolutional modules and then merges together. Now as we know to merge, here we need to ensure that the output feature maps dimension ( height and width ) needs to be the same.
class InceptionModule(tf.keras.layers.Layer):
def __init__(self, kernel_size1x1, kernel_size3x3):
super(InceptionModule, self).__init__()
# two conv modules: they will take same input tensor
self.conv1 = ConvModule(kernel_size1x1, kernel_size=(1,1), strides=(1,1))
self.conv2 = ConvModule(kernel_size3x3, kernel_size=(3,3), strides=(1,1))
self.cat = tf.keras.layers.Concatenate()
def call(self, input_tensor, training=False):
x_1x1 = self.conv1(input_tensor)
x_3x3 = self.conv2(input_tensor)
x = self.cat([x_1x1, x_3x3])
return x
Here you may notice that we are now hard-coded the exact kernel size and strides number for both convolutional layers according to the network (diagram). And also in ConvModule, we have already set padding to the same, so that the dimension of the feature maps will be the same for both (self.conv1 and self.conv2); which is required in order to concatenate them to the end.
Again, in this module, two variable performs as the placeholder, kernel_size1x1, and kernel_size3x3. This is for the purpose of course. Because we will need different numbers of feature maps to the different stages of the entire model. If we look into the diagram of the model, we will see that InceptionModule takes a different number of filters at different stages in the model.
Downsample Module
Lastly the downsampling module. The main intuition for downsampling is that we hope to get more relevant feature information that highly represents the inputs to the model. As it tends to remove the unwanted feature so that model can focus on the most relevant. There are many ways we can reduce the dimension of the feature maps (or inputs). For example: using strides 2 or using the conventional pooling operation. There are many types of pooling operation, namely: MaxPooling, AveragePooling, GlobalAveragePooling.
From the diagram, we can see that the downsampling module contains one convolutional layer and one max-pooling layer which later merges together. Now, if we look closely at the diagram (top-right), we will see that the convolutional layer takes a 3 x 3 size filter with strides 2 x 2. And the pooling layer (here MaxPooling) takes pooling size 3 x 3 with strides 2 x 2. Fair enough, however, we also ensure that the dimension coming from each of them should be the same in order to merge at the end. Now, if we remember when we design the ConvModule we purposely set the value of the padding argument to same. But in this case, we need to set it to valid.
class DownsampleModule(tf.keras.layers.Layer):
def __init__(self, kernel_size):
super(DownsampleModule, self).__init__()
# conv layer
self.conv3 = ConvModule(kernel_size, kernel_size=(3,3),
strides=(2,2), padding="valid")
# pooling layer
self.pool = tf.keras.layers.MaxPooling2D(pool_size=(3, 3),
strides=(2,2))
self.cat = tf.keras.layers.Concatenate()
def call(self, input_tensor, training=False):
# forward pass
conv_x = self.conv3(input_tensor, training=training)
pool_x = self.pool(input_tensor)
# merged
return self.cat([conv_x, pool_x])
Okay, now we have built all three modules, namely: ConvModule InceptionModule DownsampleModule. Let's initialize their parameter according to the diagram.
class MiniInception(tf.keras.Model):
def __init__(self, num_classes=10):
super(MiniInception, self).__init__()
# the first conv module
self.conv_block = ConvModule(96, (3,3), (1,1))
# 2 inception module and 1 downsample module
self.inception_block1 = InceptionModule(32, 32)
self.inception_block2 = InceptionModule(32, 48)
self.downsample_block1 = DownsampleModule(80)
# 4 inception module and 1 downsample module
self.inception_block3 = InceptionModule(112, 48)
self.inception_block4 = InceptionModule(96, 64)
self.inception_block5 = InceptionModule(80, 80)
self.inception_block6 = InceptionModule(48, 96)
self.downsample_block2 = DownsampleModule(96)
# 2 inception module
self.inception_block7 = InceptionModule(176, 160)
self.inception_block8 = InceptionModule(176, 160)
# average pooling
self.avg_pool = tf.keras.layers.AveragePooling2D((7,7))
# model tail
self.flat = tf.keras.layers.Flatten()
self.classfier = tf.keras.layers.Dense(num_classes, activation='softmax')
def call(self, input_tensor, training=True, **kwargs):
# forward pass
x = self.conv_block(input_tensor)
x = self.inception_block1(x)
x = self.inception_block2(x)
x = self.downsample_block1(x)
x = self.inception_block3(x)
x = self.inception_block4(x)
x = self.inception_block5(x)
x = self.inception_block6(x)
x = self.downsample_block2(x)
x = self.inception_block7(x)
x = self.inception_block8(x)
x = self.avg_pool(x)
x = self.flat(x)
return self.classfier(x)
The amount of filter number for each computational block is set according to the design of the model (see the diagram). After initialing all the blocks (in the __init__ function), we connect them according to the design (in the call function).
I think you are way off on your estimate of the number of parameters needed. Think more like a few million which is what you will get if you use transfer learning. You can struggle trying to make your own model if you wish but you will probable not be any better (and more likely no where near as good) as the results you will get from transfer learning. I highly recommend the MobileV2 model. Now you can make that or any of the other models perform better if you use an adjustable learning rate using ReduceLROnPlateau . Documentation for that is here. The other thing I recommend is to use the Keras callback EarlyStopping. Documentation is here. . Set it to monitor validation loss and set restore_best_weights=True. Set the number of epochs to a large number so this callback gets triggered and returns the model with the weights from the epoch with the lowest validation loss. My recommended code is shown below
height=224
width=224
img_shape=(height, width, 3)
dropout=.3
lr=.001
class_count=156 # number of classes
img_shape=(height, width, 3)
base_model=tf.keras.applications.MobileNetV2( include_top=False, input_shape=img_shape, pooling='max', weights='imagenet')
x=base_model.output
x=keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001 )(x)
x = Dense(512, kernel_regularizer = regularizers.l2(l = 0.016),activity_regularizer=regularizers.l1(0.006),
bias_regularizer=regularizers.l1(0.006) ,activation='relu', kernel_initializer= tf.keras.initializers.GlorotUniform(seed=123))(x)
x=Dropout(rate=dropout, seed=123)(x)
output=Dense(class_count, activation='softmax',kernel_initializer=tf.keras.initializers.GlorotUniform(seed=123))(x)
model=Model(inputs=base_model.input, outputs=output)
model.compile(Adamax(lr=lr), loss='categorical_crossentropy', metrics=['accuracy'])
rlronp=tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=1, verbose=1, mode='auto', min_delta=0.0001, cooldown=0, min_lr=0)
estop=tf.keras.callbacks.EarlyStopping( monitor="val_loss", min_delta=0, patience=4,
verbose=1, mode="auto", baseline=None,
restore_best_weights=True)
callbacks=[rlronp, estop]
Also look at the balance in your data set. That is, compare how many training samples you have for each class. If the ratio of most samples/least samples>2 or 3 you may want to take action to mitigate that. Numerous methods are available, the simplest is to use the class_weight parameter in model.fit. o do that you need to create a class_weights dictionary. The process to do that is outline below
Lets say your class distribution is
class0 - 500 samples
class1- 2000 samples
class2 - 1500 samples
class3 - 200 samples
Then your dictionary would be
class_weights={0: 2000/500, 1:2000/2000, 2: 2000/1500, 3: 2000/200}
in model.fit set class_weight=class_weights

tf.keras.layers.BatchNormalization with trainable=False appears to not update its internal moving mean and variance

I am trying to find out, how exactly does BatchNormalization layer behave in TensorFlow. I came up with the following piece of code which to the best of my knowledge should be a perfectly valid keras model, however the mean and variance of BatchNormalization doesn't appear to be updated.
From docs https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization
in the case of the BatchNormalization layer, setting trainable = False on the layer means that the layer will be subsequently run in inference mode (meaning that it will use the moving mean and the moving variance to normalize the current batch, rather than using the mean and variance of the current batch).
I expect the model to return a different value with each subsequent predict call.
What I see, however, are the exact same values returned 10 times.
Can anyone explain to me why does the BatchNormalization layer not update its internal values?
import tensorflow as tf
import numpy as np
if __name__ == '__main__':
np.random.seed(1)
x = np.random.randn(3, 5) * 5 + 0.3
bn = tf.keras.layers.BatchNormalization(trainable=False, epsilon=1e-9)
z = input = tf.keras.layers.Input([5])
z = bn(z)
model = tf.keras.Model(inputs=input, outputs=z)
for i in range(10):
print(x)
print(model.predict(x))
print()
I use TensorFlow 2.1.0
Okay, I found the mistake in my assumptions. The moving average is being updated during training not during inference as I thought. This makes perfect sense, as updating the moving averages during inference would likely result in an unstable production model (for example a long sequence of highly pathological input samples [e.g. such that their generating distribution differs drastically from the one on which the network was trained] could potentially bias the network and result in worse performance on valid input samples).
The trainable parameter is useful when you're fine-tuning a pretrained model and want to freeze some of the layers of the network even during training. Because when you call model.predict(x) (or even model(x) or model(x, training=False)), the layer automatically uses the moving averages instead of batch averages.
The code below demonstrates this clearly
import tensorflow as tf
import numpy as np
if __name__ == '__main__':
np.random.seed(1)
x = np.random.randn(10, 5) * 5 + 0.3
z = input = tf.keras.layers.Input([5])
z = tf.keras.layers.BatchNormalization(trainable=True, epsilon=1e-9, momentum=0.99)(z)
model = tf.keras.Model(inputs=input, outputs=z)
# a dummy loss function
model.compile(loss=lambda x, y: (x - y) ** 2)
# a dummy fit just to update the batchnorm moving averages
model.fit(x, x, batch_size=3, epochs=10)
# first predict uses the moving averages from training
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# outputs the same thing as previous predict
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# here calling the model with training=True results in update of moving averages
# furthermore, it uses the batch mean and variance as in training,
# so the result is very different
pred = model(x, training=True).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# here we see again that the moving averages are used but they differ slightly after
# the previous call, as expected
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
In the end, I found that the documentation (https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization) mentions this:
When performing inference using a model containing batch normalization, it is generally (though not always) desirable to use accumulated statistics rather than mini-batch statistics. This is accomplished by passing training=False when calling the model, or using model.predict.
Hopefully this will help someone with similar misunderstanding in the future.

How can I improve my LSTM accuracy in Tensorflow

I'm trying to figure out how to decrease the error in my LSTM. It's an odd use-case because rather than classifying, we are taking in short lists (up to 32 elements long) and outputting a series of real numbers, ranging from -1 to 1 - representing angles. Essentially, we want to reconstruct short protein loops from amino acid inputs.
In the past we had redundant data in our datasets, so the accuracy reported was incorrect. Since removing the redundant data our validation accuracy has gotten much worse, which suggests our network had learned to memorise the most frequent examples.
Our dataset is 10,000 items, split 70/20/10 between train, validation and test. We use a bi-directional, LSTM as follows:
x = tf.cast(tf_train_dataset, dtype=tf.float32)
output_size = FLAGS.max_cdr_length * 4
dmask = tf.placeholder(tf.float32, [None, output_size], name="dmask")
keep_prob = tf.placeholder(tf.float32, name="keepprob")
sizes = [FLAGS.lstm_size,int(math.floor(FLAGS.lstm_size/2)),int(math.floor(FLAGS.lstm_size/ 4))]
single_rnn_cell_fw = tf.contrib.rnn.MultiRNNCell( [lstm_cell(sizes[i], keep_prob, "cell_fw" + str(i)) for i in range(len(sizes))])
single_rnn_cell_bw = tf.contrib.rnn.MultiRNNCell( [lstm_cell(sizes[i], keep_prob, "cell_bw" + str(i)) for i in range(len(sizes))])
length = create_length(x)
initial_state = single_rnn_cell_fw.zero_state(FLAGS.batch_size, dtype=tf.float32)
initial_state = single_rnn_cell_bw.zero_state(FLAGS.batch_size, dtype=tf.float32)
outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw=single_rnn_cell_fw, cell_bw=single_rnn_cell_bw, inputs=x, dtype=tf.float32, sequence_length = length)
output_fw, output_bw = outputs
states_fw, states_bw = states
output_fw = last_relevant(FLAGS, output_fw, length, "last_fw")
output_bw = last_relevant(FLAGS, output_bw, length, "last_bw")
output = tf.concat((output_fw, output_bw), axis=1, name='bidirectional_concat_outputs')
test = tf.placeholder(tf.float32, [None, output_size], name="train_test")
W_o = weight_variable([sizes[-1]*2, output_size], "weight_output")
b_o = bias_variable([output_size],"bias_output")
y_conv = tf.tanh( ( tf.matmul(output, W_o)) * dmask, name="output")
Essentially, we use 3 layers of LSTM, with 256, 128 and 64 units each. We take the last step of both the Forward and Backward passes and concatenate them together. These feed into a final, fully connected layer that presents the data in the way we need it. We use a mask to set these steps we don't need to zero.
Our cost function uses a mask again, and takes the mean of the squared difference. We build the mask from the test data. Values to ignore are set to -3.0.
def cost(goutput, gtest, gweights, FLAGS):
mask = tf.sign(tf.add(gtest,3.0))
basic_error = tf.square(gtest-goutput) * mask
basic_error = tf.reduce_sum(basic_error)
basic_error /= tf.reduce_sum(mask)
return basic_error
To train the net I've used a variety of optimizers. The lowest scores have been obtained with the AdamOptimizer. The others, such as Adagrad, Adadelta, RMSProp tend to flatline around 0.3/0.4 error which is not particularly great.
Our learning rate is 0.004, batch size of 200. We use a 0.5 probability dropout layer.
I've tried adding more layers, changing learning rates, batch sizes, even the representation of the data. I've attempted batch regularisation, L1 and L2 weight regularisation (though perhaps incorrectly) and I've even considered switching to a convnet approach instead.
Nothing seems to make any difference. What has seemed to work is changing the optimizer. Adam seems noisier as it improves, but it does get closer than the other optimizers.
We need to get down to a value much closer to 0.05 or 0.01. Sometimes the training error touches 0.09 but the validation doesn't follow. I've run this network for about 500 epochs so far (about 8 hours) and it tends to settle around 0.2 validation error.
I'm not quite sure what to attempt next. Decayed learning rate might help but I suspect there is something more fundamental I need to do. It could be something as simple as a bug in the code - I need to double check the masking,

Strange sequence classification performance after shuffling sequence elements

I have one million sequences I'm trying to classify as either 0 or 1. The outcome is fairly well balanced (class 0:70%, class 1:30%). Maximum sequence length is 50, and I've post-padded by sequences with zeroes. There are 100 unique sequence symbols. Embedding length is 30. It's an LSTM NN trained on two outputs (one is the main output node, and the other is right after the LSTM). The code is below.
As a sanity check, I ran three versions of this: One in which I randomize the outcome labels (I expect terrible performance), another one where the labels are correct but I randomize the sequence of events in each sequence but the outcome labels are correct (I also expected bad performance), and finally one where everything is left unshuffled (I expected good performance).
Instead I found the following:
Shuffled labels: Accuracy = 69.5% (Model predicts every sequence is class 0)
Shuffled sequence symbols: Accuracy = 88%!
Nothing is shuffled: Accuracy = 90%
What do you make of this? All I can think of is that there is little signal to be gained from analyzing the sequences, and maybe most of the signal is from the presence or lack of presence of symbols in the sequence. Maybe RNNs and LSTMs are overkill here?
# Input 1: event type sequences
# Take the event integer sequences, run them through an embedding layer to get float vectors, then run through LSTM
main_input = Input(shape =(max_seq_length,), dtype = 'int32', name = 'main_input')
x = Embedding(output_dim = embedding_length, input_dim = num_unique_event_symbols, input_length = max_seq_length, mask_zero=True)(main_input)
lstm_out = LSTM(32)(x)
# Auxiliary loss here from first input
auxiliary_output = Dense(1, activation='sigmoid', name='aux_output')(lstm_out)
# An abitrary number of dense, hidden layers here
x = Dense(64, activation='relu')(lstm_out)
# The main output node
main_output = Dense(1, activation='sigmoid', name='main_output')(x)
## Compile and fit the model
model = Model(inputs=[main_input], outputs=[main_output, auxiliary_output])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'], loss_weights=[1., 0.2])
print(model.summary())
np.random.seed(21)
model.fit([train_X1], [train_Y, train_Y], epochs=1, batch_size=200)
Assuming you've played around with the size of the LSTM, your conclusion seems reasonable. Beyond that, it's hard to say as it depends what the dataset is. For example, it could be that shorter sequences are more unpredictable, and if most of your sequences are short, then this would support the conclusion as well.
It's worth it to also try truncating your sequences in length, to say the first 25 entries.