Tensorflow GPU Memory exhausted during mean squared error - tensorflow

I have a Tensorflow model with is a recurrent neural network using long short term memory. The state size is 3000, each time step of input has 300 inputs, there are about 500 time steps, and 1 output for each time step. I am training a sequence-to-sequence model.
It runs fine for inputs with less than 500 time steps, but somewhere around 500 timesteps, it crashes with the following out of memory error:
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[20375,20375]
[[Node: gradients/mean_squared_error/Mul_grad/mul_1 = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](mean_squared_error/Square, gradients/mean_squared_error/Sum_grad/Tile)]]
[[Node: gradients/MatMul_grad/tuple/control_dependency_1/_225 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_5086_gradients/MatMul_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
And this is running on a GPU with 12gb of memory.
I have tried running it on my laptop cpu, and it seems to use very little memory (about 1 to 2 gb), but it's so slow that it never did get to 500 time steps. I'm working on some changes that will make it skip to 500 time steps to see how much memory it uses when not running on a GPU.
My questions is: Where could Tensorflow possibly want to allocate a tensor of shape [20375, 20375]? It seems to be related to the tf.mean_squared_error function, but that doesn't seem like an operation that should require such exorbitant amounts of memory.
I have tried reducing the batch size, but that just pushes the failure point up to a few more time steps, and I'll need up to a few thousand time steps, so this doesn't seem like a good long-term solution. I'd prefer to get to the root of the problem.
Here is the relevant code for the mean squared error:
initial_state_tuple = tf.contrib.rnn.LSTMStateTuple(initial_state, initial_hidden_state)
# Create the actual RNN
with tf.variable_scope(VARIABLE_SCOPE, reuse=None):
cell = tf.contrib.rnn.BasicLSTMCell(STATE_SIZE)
rnn_outputs, finalstate = tf.nn.dynamic_rnn(cell=cell, inputs=networkinput,
initial_state=initial_state_tuple)
with tf.variable_scope(VARIABLE_SCOPE, reuse=True):
weights = tf.get_variable(name=WEIGHTS_NAME, shape=[STATE_SIZE, 1], dtype=tf.float32)
biases = tf.get_variable(name=BIASES_NAME, shape=[1], dtype=tf.float32)
# Build the output layers
rnn_outputs_reshaped = tf.reshape(rnn_outputs, [-1, STATE_SIZE])
network_outputs = tf.sigmoid(tf.matmul(rnn_outputs_reshaped, weights) + biases)
expected_outputs_reshaped = tf.reshape(expected_outputs, [-1, 1])
# Loss mask just cancels out the inputs that are padding characters, since not all inputs have the same number of time steps
loss_mask_reshaped = tf.reshape(loss_mask, shape=[-1])
expected_outputs_reshaped = loss_mask_reshaped * expected_outputs_reshaped
network_outputs = loss_mask_reshaped * network_outputs
loss = tf.losses.mean_squared_error(labels=expected_outputs_reshaped, predictions=network_outputs)
If you want all of the code, it can be found here. The relevant functions are buildtower() and buildgraph(). The constants NUM_GPUS and BATCH_SIZE are set to appropriate values when running on the machine with the GPUs.
Update: I replaced the line
loss = tf.losses.mean_squared_error(labels=expected_outputs_reshaped, predictions=network_outputs)
with
error_squared = tf.pow(expected_outputs_reshaped - network_outputs, 2)
loss = tf.reduce_mean(error_squared)
and the same error happened. I reduced the state size to 30 and the batch size to 5, and the error still happened, although it did make it up to about 3000 time steps.
Update: After doing some research, I have found that, when training an RNN with a large number of time steps, truncated backpropagation is often used. This leads me to believe that backpropagation through a large number of time steps inherently takes a lot of memory, and my issue is not that I've constructed my graph wrong, but that I have a fundamental misunderstanding of the resource requirements of gradient calculations. To this end, I am working on changing my code to use truncated backpropagation. I will report back with results.

This project is my first experience with machine learning and Tensorflow, and after doing some research, it seems I had some fundamental misunderstandings.
I had thought that memory usage would scale linearly with the number of time steps in my data. Because every other dimension of my model (Batch size, state size) was small, I expected that I could get up to quite a few time steps before running out of memory. However, it seems that memory usage of computing the gradients scales exponentially with the number of time steps, so no matter how small I made the state size and batch size, it would eventually exhaust all my memory because of the large number of time steps.
To deal with this, I am using truncated backpropagation, in which each batch is broken up into chunks of some fixed number of time steps. This is not perfect, because it means that errors can only be propagated back at most this many time steps. However, based on what I've found online, it seems to work well enough, and there's not too many other ways to get around the memory usage issue.
As I said before, this is all my first experience with machine learning, so if anything in here is blatantly wrong, please tell me.

Related

Unstable loss in binary classification for time-series data - extremely imbalanced dataset

I am working on deep learning model to detect regions of timesteps with anomalies. This model should classify each timestep as possessing the anomaly or not.
My labels are something like this:
labels = [0 0 0 1 0 0 0 0 1 0 0 0 ...]
The 0s represent 'normal' timesteps and the 1s represent the existence of an anomaly. In reality, my dataset is very very imbalanced:
My training set consists of over 7000 samples, where only 1400 samples = 20% of those contain at least 1 anomaly (timestep = 1)
I am feeding samples with 4096 timesteps each. The average number of anomalies, in the samples that contain them, is around 2. So, assuming there is an anomaly, the % of anomalous timesteps ranges from 0.02% to 0.04% for each sample.
With that said, I do need to shift from the standard binary cross entropy to something that highlights the anomalous timesteps from the anomaly free timesteps.
So, I experimented adding weights to the anomalous class in such a way that the model is forced to learn from the anomalies and not just reduce its loss from the anomaly-free timesteps. It actually worked well and the model seems to learn to detect anomalous timesteps. One problem however is that training can become quite unstable (and unpredictable), with sudden loss spikes appearing and affecting the learning process. Below, you can see the effects on the loss and metrics charts for two of my trainings:
After going through a debugging process for the trainings, I am confident that the problem comes from ocasional predictions given for the anomalous timesteps. That is, in some samples of a certain epoch, and in some anomalous timesteps, the model is giving a very low prediction, e.g. 0.01, for the 1s label (should be close to 1 ofc). Considering the very high (but supposedly necessary) weights given to the anomalous timesteps, the penaly is really extreme and the loss just skyrockets.
Going deeper, if I inspect the losses of the sample where the jump happened and look for the batch right before the loss jumped, I see that the losses are all around 10^-2 - 0.0053, 0.004, 0.0041... - not a single sample with a loss over those values. Overall, an average loss of 0.005. However, if I inspect the loss of the following batch, in that same sample, the avg. loss of the batch is already 3.6, with a part of the samples with a low loss but another part with a very high loss - e.g. 9.2, 7.7, 8.9... I can confirm that all the high losses come from the penalties given at predicting the 1s timesteps. The following batches of the same sample and some of the batches of the next epoch get affected and take some time to start decreasing again and going back to a stable learning process.
With this said, I am having this problem for some weeks already and really need some guidance in what I could try to deal with the spikes, which I assume that arise on the gradient updates associated with anomalous timesteps that are harder to learn.
I am currently using a simple 2-layer keras LSTM model with 64 units each and a dense as the last layer with a 1 unit dense layer with sigmoid activation. As for the optimizer I am using Adam. I am training with batch size 128. Some things to consider also:
I have tried changes in weights and other loss functions. Ultimately, if I reduce the weights given to the anomalous timesteps the model doesn't give so much importance to them and the loss reduces by considering only the anomalous free timesteps. I have also considered focal binary cross entropy loss but it doesn't seem to do anything that could avoid those jumps as, in the end, it is all about adding or reducing weights for certain timesteps.
My current learning rate is the Adam's default, 10⁻3. I have tried reducing the learning rate which leads to less impactful spikes (they're still there though) but the model also takes much more time or gets stuck. Not sure if it would be the way to go in this case, as the training seems to go well except for these cases. Decaying learning rate might also not make too much sense as the spikes can happen earlier in the training and not only on later epochs. Not sure if this is the way to go.
I am still investigating gradient clipping as a solution. I am still not sure on what values to use and if it is actually an effective solution for my case, but from what I understood of it, it should allow to counter those jumps resulting from those 'almost' exploding gradients.
The spikes could originate from sample noise / bad samples. However, since I am already using batch size 128 and I have already tested training with simple synthetic samples I have created and the spikes were still there, I guess it is not a problem with specific samples.
The imbalance obviously plays the bigger role here. Not sure if undersampling the majority class of samples of 4096 timesteps (like increasing from 20% to 50% the amount of samples with at least an anomalous timestep) would make a big difference here since each sample of timesteps is by itself very imbalanced as it contains around 2 timesteps with anomalies. It is a problem with the imbalance within each sample.
I know it might be quite some context but honestly I am already into my limit of trying stuff for weeks.
The solutions I am inclined to go for next are either gradient clipping or just changing my samples to be more centered around the anomalous timesteps, in such a way that it contains less anomaly free timesteps and hopefully allows for convergence without having to apply such drastic weights to anomalous timesteps. This last option is more difficult for me to opt for due to some restrictions, but I might look at it if I have nothing else available.
What do you think? I am able to provide more information if needed.

How to proceed training after stopping it and changing some parameters?

I'm training my model via model.fit() in Keras. I stopped the training, by interrupting it, or even because it is done, and then changed the batch_size and decided to go with more training. Here is what's happening:
The loss when the training was stopped/finished = 26
The loss when the training proceeded = 46
Meanining that I lost all the progress I made and it is as if I'ms starting over.
It does procceed from where it left only if I don't change anything. But if I changed the batch size, it is as if the optimizer re-initializes my weights and throw out my progress. How can I get a handle on what the optimizer is doing without my consent ?
You most likely have some examples that give you large loss values. MSE makes this worse. When batch size is larger then you are probably getting a lot of these outliers in your batch. You can look at the top loss contributing examples.

Added data reducing accuracy

I'm running into a situation where giving a neural network extra data reduces accuracy, and I can't see how that's possible.
Suppose you train a neural network - just a binary classifier - on a set of examples that have, say, 10 variables each. And it learns to classify both training and tests sets quite accurately. Then rerun with the same examples but extra variables on each example, say an extra 20 variables each. Maybe the extra variables don't give as good a signal as the original ones, but it's still getting the original variables too. Worst-case scenario, it should just take a bit more time learning to ignore the extra variables, right? On the face of it, there shouldn't be any way for the accuracy to be less?
To go through everything I can think of:
It's the same set of records in each case.
All the original variables are still there, just with some extra ones added.
It's not about overfitting; the network trained with the extra data is much less accurate on both the training and test sets.
I don't think it's about needing more time. It's been running for a long time now and showing no signs of making progress.
I've tried with the learning rate both unchanged and reduced, same result each way.
Using TensorFlow, simple feedforward network with one hidden layer, Adam optimizer. Code is at https://github.com/russellw/tf-examples/blob/master/multilayer.py and the most important section is
# Inputs and outputs
X = tf.placeholder(dtype, shape=(None, cols))
Y = tf.placeholder(dtype, shape=None,)
# Hidden layers
n1 = 3
w1 = tf.Variable(rnd((cols, n1)), dtype=dtype)
b1 = tf.Variable(rnd(n1), dtype=dtype)
a1 = tf.nn.sigmoid(tf.matmul(X, w1) + b1)
pr('layer 1: {}', n1)
# Output layer
no = 1
wo = tf.Variable(rnd((n1, no)), dtype=dtype)
bo = tf.Variable(rnd(no), dtype=dtype)
p = tf.nn.sigmoid(tf.squeeze(tf.matmul(a1, wo)) + bo)
tf.global_variables_initializer().run()
# Model
cost = tf.reduce_sum((p-Y)**2/rows)
optimizer = tf.train.AdamOptimizer(args.learning_rate).minimize(cost)
tf.global_variables_initializer().run()
How is it possible for the extra data to make the network less accurate? What am I missing?
You are confusing between variables and (training) data. Variables are something you use to find, say weights and biases, that help the network learn from data. So by increasing variables you are increasing trainable units, which warrants more time obviously. After a certain threshold, your data might become insufficient for the network to learn/update these variables.
Extra data means simply more examples.
So in your case it seems that you've crossed that threshold (assuming you've waited for your NN learning, long enough before saying it's not learning anymore).
You may want to know a phenomenon called curse of dimensionality. From Wikipedia, it says:
The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse.
https://en.wikipedia.org/wiki/Curse_of_dimensionality
It turns out to have at least something to do with the properties of the optimizer. Though Adam worked the best of all the optimizers in one case, in a slightly different case it fails badly where Ftrl solves the problem. I don't know why Adam has this failure mode, but current solution: make optimizer a parameter, use a batch file to loop through all of them.

TensorFlow Increase Batch Size mid experiment

I am looking to replicate some of the behavior in this paper "don't decay the learning rate, increase the batch size" and I am wondering if there is a simple approach to increase the batch size within a GCMLE experiment. I have a custom estimator and I am trying to think of any ways to adjust the batch size within the experiment. I realize that I could run with one batch size for a certain number of epochs and then load this saved graph and kick off a subsequent experiment, but I am wondering if there are any other options to update the batch size within the same experiment?
Setting up your graph to support a variable batch size is pretty easy, just use a None in the shape of the first dimension. Take a look at this article:
Build a graph that works with variable batch size using Tensorflow
Then you feed in any size batch at every sess.run(train_op, feed_dict=[X:data, Y:labels]) where the first dimension of X, your batch, is variable length.
It pretty much just works as you'd expect.
Example graph structure wiht variable batch size:
X = tf.placeholder("float", [None, num_input])
Y = tf.placeholder("float", [None, num_classes])
In general, you're allowed to have 1 unknown dimension in your tensors. Tensorflow will infer that dimension based on the actual data you pass it at runtime.
In this example, in your first iterations your data shape might be [10, 784] (batches of 10), and in later iterations maybe your shape becomes [50, 784] (batches of 50). The rest of your graph setup will work without change.
One approach is to set delay_workers_by_global_step=True in the constructor to Experiment.
The reason this works is because the effective batch size is batch_size * num_workers. So if you delay the start of other workers, your batch size will gradually increase.
Of course, your throughput will be correspondingly lower in the early phases.
If you directly want to control the batch_size, you will have to effectively replicate the behavior of learn_runner.run in your own code. That wouldn't be too bad, except for the fact that deep down in experiment.py, it starts a server which, AFAICT, cannot be disabled.

[MXNet]Periodic Loss Value when training with "step" learning rate policy

When training deep CNN, a common way is to use SGD with momentum with a "step" learning rate policy (e.g. learning rate set to be 0.1,0.01,0.001.. at different stages of training).But I encounter an unexpected phenomenon when training with this strategy under MXNet.
That is the periodic training loss value
https://user-images.githubusercontent.com/26757001/31327825-356401b6-ad04-11e7-9aeb-3f690bc50df2.png
The above is the training loss at a fixed learning rate 0.01, where the loss is decreasing normally
https://user-images.githubusercontent.com/26757001/31327872-8093c3c4-ad04-11e7-8fbd-327b3916b278.png
However, at the second stage of training (with lr 0.001) , the loss goes up and down periodically, and the period is exactly an epoch
So I thought it might be the problem of data shuffling, but it cannot explain why it doesn't happen in the first stage. Actually I used ImageRecordIter as the DataIter and reset it after every epoch, is there anything I missed or set mistakenly?
train_iter = mx.io.ImageRecordIter(
path_imgrec=recPath,
data_shape=dataShape,
batch_size=batchSize,
last_batch_handle='discard',
shuffle=True,
rand_crop=True,
rand_mirror=True)
The codes for training and loss evaluation:
while True:
train_iter.reset()
for i,databatch in enumerate(train_iter):
globalIter += 1
mod.forward(databatch,is_train=True)
mod.update_metric(metric,databatch.label)
if globalIter % 100 == 0:
loss = metric.get()[1]
metric.reset()
mod.backward()
mod.update()
Actually the loss can converge, but it takes too long.
I've suffered from this problem for a long period of time, on different network and different datasets.
I didn't have this problem when using Caffe. Is this due to the implementation difference?
Your loss/learning curves look suspiciously smooth, and I believe you can observe the same oscillation in the loss even when the learning rate is set to 0.01 just at a smaller relative scale (i.e. if you 'zoomed in' to the chart you'd see the same pattern). You may have an issue with your data iterator passing the same batch for example. And your training loop looks wrong but this could be due to formatting (e.g. mod.update() only performed every 100 batches isn't correct).
You can observe periodicity in your loss when you're traveling across a valley in the loss surface, up and down the sides rather than down the valley. Choosing a lower learning rate can help fix this, and make sure you are using momentum too.