Tensorflow: OOM when batch size too large

Tensorflow: OOM when batch size too large - tensorflow

My script is failing due to too high memory usage. When I reduce the batch size it works.
#tf.function(autograph=not DEBUG)
def step(prev_state, input_b):
input_b = tf.reshape(input_b, shape=[1,input_b.shape[0]])
state = FastALIFStateTuple(v=prev_state[0], z=prev_state[1], b=prev_state[2], r=prev_state[3])
new_b = self.decay_b * state.b + (tf.ones(shape=[self.units],dtype=tf.float32) - self.decay_b) * state.z
thr = self.thr + new_b * self.beta
z = state.z
i_in = tf.matmul(input_b, W_in)
i_rec = tf.matmul(z, W_rec)
i_t = i_in + i_rec
I_reset = z * thr * self.dt
new_v = self._decay * state.v + (1 - self._decay) * i_t - I_reset
# Spike generation
is_refractory = tf.greater(state.r, .1)
zeros_like_spikes = tf.zeros_like(z)
new_z = tf.where(is_refractory, zeros_like_spikes, self.compute_z(new_v, thr))
new_r = tf.clip_by_value(state.r + self.n_refractory * new_z - 1,
0., float(self.n_refractory))
return [new_v, new_z, new_b, new_r]
#tf.function(autograph=not DEBUG)
def evolve_single(inputs):
accumulated_state = tf.scan(step, inputs, initializer=state0)
Z = tf.squeeze(accumulated_state[1]) # -> [T,units]
if self.model_settings['avg_spikes']:
Z = tf.reshape(tf.reduce_mean(Z, axis=0), shape=(1,-1))
out = tf.matmul(Z, W_out) + b_out
return out # - [BS,Num_labels]
# # - Using a simple loop
# out_store = []
# for i in range(fingerprint_3d.shape[0]):
# out_store.append(tf.squeeze(evolve_single(fingerprint_3d[i,:,:])))
# return tf.reshape(out_store, shape=[fingerprint_3d.shape[0],self.d_out])
final_out = tf.squeeze(tf.map_fn(evolve_single, fingerprint_3d)) # -> [BS,T,self.units]
return final_out
This code snippet is inside a tf.function, but I omitted it since I don't think it's relevant.
As can be seen, I run the code on fingerprint_3d, a tensor that has the dimension [BatchSize,Time,InputDimension], e.g. [50,100,20]. When I run this with BatchSize < 10 everything works fine, although tf.scan already uses a lot of memory for that.
When I now execute the code on a batch of size 50, suddenly I get an OOM, even though I am executing it in an iterative matter (here commented out).
How should I execute this code so that the Batch Size doesn't matter?
Is tensorflow maybe parallelizing my for loop so that it executed over multiple batches at once?
Another unrelated question is the following: What function instead of tf.scan should I use if I only want to accumulate one state variable, compared to the case for tf.scan where it just accumulates all the state variables? Or is that possible with tf.scan?

As mentioned in the discussions here, tf.foldl, tf.foldr, and tf.scan all require keeping track of all values for all iterations, which is necessary for computations like gradients. I am not aware of any ways to mitigate this issue; still, I would also be interested if anyone has a better answer than mine.

When I used
#tf.function
def get_loss_and_gradients():
with tf.GradientTape(persistent=False) as tape:
logits, spikes = rnn.call(fingerprint_input=graz_dict["train_input"], W_in=W_in, W_rec=W_rec, W_out=W_out, b_out=b_out)
loss = loss_normal(tf.cast(graz_dict["train_groundtruth"],dtype=tf.int32), logits)
gradients = tape.gradient(loss, [W_in,W_rec,W_out,b_out])
return loss, logits, spikes, gradients
it works.
When I remove the #tf.function decorator the memory blows up. So it really seems important that tensorflow can create a graph for you computations.

Related

How to minimise a multivariate cost function in Julia with Optim?

I am currently stuck trying to utilize the Optim package in Julia in an attempt to minimize a cost function. The cost function is the cost function for an L2 regularised logistic regression. It is constructed as follows;
using Optim
function regularised_cost(X, y, θ, λ)
m = length(y)
# Sigmoid predictions
h = sigmoid(X * θ)
# left side of the cost function
positive_class_cost = ((-y)' * log.(h))
# right side of the cost function
negative_class_cost = ((1 .- y)' * log.(1 .- h))
# lambda effect
lambda_regularization = (λ/(2*m) * sum(θ[2 : end] .^ 2))
# Current batch cost
𝐉 = (1/m) * (positive_class_cost - negative_class_cost) + lambda_regularization
# Gradients for all the theta members with regularization except the constant
∇𝐉 = (1/m) * (X') * (h-y) + ((1/m) * (λ * θ))
∇𝐉[1] = (1/m) * (X[:, 1])' * (h-y) # Exclude the constant
return (𝐉, ∇𝐉)
end
I would like to use LBFGS algorithm as a solver to find the best weights that minimize this function based on my training examples and labels which are defined as:
opt_train = [ones(size(X_train_scaled, 1)) X_train_scaled] # added intercept
initial_theta = zeros(size(opt_train, 2))
Having read the documentation, here's my current implementation which is currently not working:
cost, gradient! = regularised_cost(opt_train, y_train, initial_theta, 0.01)
res = optimize(regularised_cost,
gradient!,
initial_theta,
LBFGS(),
Optim.Options(g_tol = 1e-12,
iterations = 1000,
store_trace = true,
show_trace = true))
How do I pass my training examples and labels along with the gradients so that the solver (LBFGS) can find me the best weights for theta?

You need to close over your train data and make a loss function that only takes the parameters as inputs.
As per the docs on dealing with constant parameterised
It should be so.wthing like:
loss_and_grad(theta) = regularised_cost(opt_train, y_train, theta, 0.01)
loss(theta) = first(loss_and_grad(theta))
res = optimize(loss, initial_theta)
I will leave it to you to see how to hook the gradient in.
A reminder though: don't use non-const globals.
They are slow, in particular the way they are used in the loss_and_grad function I wrote will be slow.
So you should declare opt_train and y_train as const.
Or make a function that takes them and returns a loss function etc

tensoflow sparse tensor efficient batching

I have a method (shown below) that gets a batch from a tensorflow SparseTensorValue. However, this method is rather slow (10-20 seconds for a batch of size 32), which is problematic because it's called thousands of times.
def get_batch(index, tensors, batch_size, nItems):
xs, ys = tensors
begin = (index * batch_size)
end = min((index+1)*batch_size, nItems)
y_b = ys[begin:end]
(inds, vals, dsize) = xs
nInds = [[ind[0] - begin, ind[1]] for ind in inds if begin <= ind[0] < end]
nInds = np.array(nInds)
nVals = vals[:nInds.shape[0]]
nDsize = (end - begin, dsize[1])
x_b = tf.SparseTensorValue(nInds, nVals, nDsize)
return (x_b, y_b)
Is there a way to make this method more efficient?

I'd recommend you to write your input pipeline using tf.data instead, then if anything you can offload this rebatching to another core and not block your main thread.

consistent forward / backward pass with tensorflow dropout

For the reinforcement learning one usually applies forward pass of the neural network for each step of the episode in order to calculate policy. Afterwards one could calculate parameter gradients using backpropagation. Simplified implementation of my network looks like this:
class AC_Network(object):
def __init__(self, s_size, a_size, scope, trainer, parameters_net):
with tf.variable_scope(scope):
self.is_training = tf.placeholder(shape=[], dtype=tf.bool)
self.inputs = tf.placeholder(shape=[None, s_size], dtype=tf.float32)
# (...)
layer = slim.fully_connected(self.inputs,
layer_size,
activation_fn=tf.nn.relu,
biases_initializer=None)
layer = tf.contrib.layers.dropout(inputs=layer, keep_prob=parameters_net["dropout_keep_prob"],
is_training=self.is_training)
self.policy = slim.fully_connected(layer, a_size,
activation_fn=tf.nn.softmax,
biases_initializer=None)
self.actions = tf.placeholder(shape=[None], dtype=tf.int32)
self.advantages = tf.placeholder(shape=[None], dtype=tf.float32)
actions_onehot = tf.one_hot(self.actions, a_size, dtype=tf.float32)
responsible_outputs = tf.reduce_sum(self.policy * actions_onehot, [1])
self.policy_loss = - policy_loss_multiplier * tf.reduce_mean(tf.log(responsible_outputs) * self.advantages)
local_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)
self.gradients = tf.gradients(self.policy_loss, local_vars)
Now during training I will fist rollout the episode by consecutive forward passes (again, simplified version):
s = self.local_env.reset() # list of input variables for the first step
while done == False:
a_dist = sess.run([self.policy],
feed_dict = {self.local_AC.inputs: [s],
self.is_training: True})
a = np.argmax(a_dist)
s, r, done, extra_stat = self.local_env.step(a)
# (...)
and in the end I will calculate gradients by backward pass:
p_l, grad = sess.run([self.policy_loss,
self.gradients],
feed_dict={self.inputs: np.vstack(comb_observations),
self.is_training: True,
self.actions: np.hstack(comb_actions),})
(please note that I could have made a mistake somewhere above trying to remove as much as possible of the original code irrelevant to the issue in question)
So finally the question: Is there a way of ensuring that all the consecutive calls to the sess.run() will generate the same dropout structure? Ideally I would like to have exactly the same dropout structure within each episode and only change it between episodes. Things seem to work well as they are but I continue to wonder.

RNN Slow-down phenomenon of Tensorflow

I found a peculiar property of lstm cell(not limited to lstm but I only examined with this) of tensorflow which has not been reported as far as I know.
I don't know whether it actually has, so I left this post in SO. Below is a toy code for this problem:
import tensorflow as tf
import numpy as np
import time
def network(input_list):
input,init_hidden_c,init_hidden_m = input_list
cell = tf.nn.rnn_cell.BasicLSTMCell(256, state_is_tuple=True)
init_hidden = tf.nn.rnn_cell.LSTMStateTuple(init_hidden_c, init_hidden_m)
states, hidden_cm = tf.nn.dynamic_rnn(cell, input, dtype=tf.float32, initial_state=init_hidden)
net = [v for v in tf.trainable_variables()]
return states, hidden_cm, net
def action(x, h_c, h_m):
t0 = time.time()
outputs, output_h = sess.run([rnn_states[:,-1:,:], rnn_hidden_cm], feed_dict={
rnn_input:x,
rnn_init_hidden_c: h_c,
rnn_init_hidden_m: h_m
})
dt = time.time() - t0
return outputs, output_h, dt
rnn_input = tf.placeholder("float", [None, None, 512])
rnn_init_hidden_c = tf.placeholder("float", [None,256])
rnn_init_hidden_m = tf.placeholder("float", [None,256])
rnn_input_list = [rnn_input, rnn_init_hidden_c, rnn_init_hidden_m]
rnn_states, rnn_hidden_cm, rnn_net = network(rnn_input_list)
feed_input = np.random.uniform(low=-1.,high=1.,size=(1,1,512))
feed_init_hidden_c = np.zeros(shape=(1,256))
feed_init_hidden_m = np.zeros(shape=(1,256))
sess = tf.Session()
sess.run(tf.global_variables_initializer())
for i in range(10000):
_, output_hidden_cm, deltat = action(feed_input, feed_init_hidden_c, feed_init_hidden_m)
if i % 10 == 0:
print 'Running time: ' + str(deltat)
(feed_init_hidden_c, feed_init_hidden_m) = output_hidden_cm
feed_input = np.random.uniform(low=-1.,high=1.,size=(1,1,512))
[Not important]What this code does is to generate an output from 'network()' function containing LSTM where the input's temporal dimension is 1, so output's is also 1, and pull in&out initial state for each step of running.
[Important] Looking the 'sess.run()' part. For some reasons in my real code, I happened to put [:,-1:,:] for 'rnn_states'. What is happening is then the time spent for each 'sess.run()' increases. For some inspection by my own, I found this slow down stems from that [:,-1:,:]. I just wanted to get the output at the last time step. If you do 'outputs, output_h = sess.run([rnn_states, rnn_hidden_cm], feed_dict{~' w/o [:,-1:,:] and take 'last_output = outputs[:,-1:,:]' after the 'sess.run()', then the slow down does not occur.
I do not know why this exponential increment of time happens with that [:,-1:,:] running. Is this the nature of tensorflow hasn't been documented but particularly slows down(may be adding more graph by its own?)?
Thank you, and hope this mistake not happen for other users by this post.

I encountered the same problem, with TensorFlow slowing down for each iteration I ran it, and found this question while trying to debug it. Here's a short description of my situation and how I solved it for future reference. Hopefully it can point someone in the right direction and save them some time.
In my case the problem was mainly that I didn't make use of feed_dict to supply the network state when executing sess.run(). Instead I redeclared outputs, final_state and prediction every iteration. The answer at https://github.com/tensorflow/tensorflow/issues/1439#issuecomment-194405649 made me realize how stupid that was... I was constantly creating new graph nodes in every iteration, making it all slower and slower. The problematic code looked something like this:
# defining the network
lstm_layer = rnn.BasicLSTMCell(num_units, forget_bias=1)
outputs, final_state = rnn.static_rnn(lstm_layer, input, initial_state=rnn_state, dtype='float32')
prediction = tf.nn.softmax(tf.matmul(outputs[-1], out_weights)+out_bias)
for input_data in data_seq:
# redeclaring, stupid stupid...
outputs, final_state = rnn.static_rnn(lstm_layer, input, initial_state=rnn_state, dtype='float32')
prediction = tf.nn.softmax(tf.matmul(outputs[-1], out_weights)+out_bias)
p, rnn_state = sess.run((prediction, final_state), feed_dict={x: input_data})
The solution was of course to only declare the nodes once in the beginning, and supply the new data with feed_dict. The code went from being half slow (> 15 ms in the beginning) and becoming slower for every iteration, to execute every iteration in around 1 ms. My new code looks something like this:
out_weights = tf.Variable(tf.random_normal([num_units, n_classes]), name="out_weights")
out_bias = tf.Variable(tf.random_normal([n_classes]), name="out_bias")
# placeholder for the network state
state_placeholder = tf.placeholder(tf.float32, [2, 1, num_units])
rnn_state = tf.nn.rnn_cell.LSTMStateTuple(state_placeholder[0], state_placeholder[1])
x = tf.placeholder('float', [None, 1, n_input])
input = tf.unstack(x, 1, 1)
# defining the network
lstm_layer = rnn.BasicLSTMCell(num_units, forget_bias=1)
outputs, final_state = rnn.static_rnn(lstm_layer, input, initial_state=rnn_state, dtype='float32')
prediction = tf.nn.softmax(tf.matmul(outputs[-1], out_weights)+out_bias)
# actual network state, which we input with feed_dict
_rnn_state = tf.nn.rnn_cell.LSTMStateTuple(np.zeros((1, num_units), dtype='float32'), np.zeros((1, num_units), dtype='float32'))
it = 0
for input_data in data_seq:
encl_input = [[input_data]]
p, _rnn_state = sess.run((prediction, final_state), feed_dict={x: encl_input, rnn_state: _rnn_state})
print("{} - {}".format(it, p))
it += 1
Moving the declaration out from the for loop also got rid of the problem which the OP sdr2002 had, doing a slice outputs[-1] in sess.run() inside the for loop.

As mentioned above, no sliced output for 'sess.run()' is much appreciated for this case.
def action(x, h_c, h_m):
t0 = time.time()
outputs, output_h = sess.run([rnn_states, rnn_hidden_cm], feed_dict={
rnn_input:x,
rnn_init_hidden_c: h_c,
rnn_init_hidden_m: h_m
})
outputs = outputs[:,-1:,:]
dt = time.time() - t0
return outputs, output_h, dt

Using TensorArrays in the context of a while_loop to accumulate values

Below I have an implementation of a Tensorflow RNN Cell, designed to emulate Alex Graves' algorithm ACT in this paper: http://arxiv.org/abs/1603.08983.
At a single timestep in the sequence called via rnn.rnn(with a static sequence_length parameter, so the rnn is unrolled dynamically - I am using a fixed batch size of 20), we recursively call ACTStep, producing outputs of size(1,200) where the hidden dimension of the RNN cell is 200 and we have a batch size of 1.
Using the while loop in Tensorflow, we iterate until the accumulated halting probability is high enough. All of this works reasonably smoothly, but I am having problems accumulating states, probabilities and outputs within the while loop, which we need to do in order to create weighted combinations of these as the final cell output/state.
I have tried using a simple list, as below, but this fails when the graph is compiled as the outputs are not in the same frame(is it possible to use the "switch" function in control_flow_ops to forward the tensors to the point at which they are required, ie the add_n function just before we return the values?). I have also tried using the TensorArray structure, but I am finding this difficult to use as it seems to destroy shape information and replacing it manually hasn't worked. I also haven't been able to find much documentation on TensorArrays, presumably as they are, I imagine, mainly for internal TF use.
Any advice on how it might be possible to correctly accumulate the variables produced by ACTStep would be much appreciated.
class ACTCell(RNNCell):
"""An RNN cell implementing Graves' Adaptive Computation time algorithm"""
def __init__(self, num_units, cell, epsilon, max_computation):
self.one_minus_eps = tf.constant(1.0 - epsilon)
self._num_units = num_units
self.cell = cell
self.N = tf.constant(max_computation)
#property
def input_size(self):
return self._num_units
#property
def output_size(self):
return self._num_units
#property
def state_size(self):
return self._num_units
def __call__(self, inputs, state, scope=None):
with vs.variable_scope(scope or type(self).__name__):
# define within cell constants/ counters used to control while loop
prob = tf.get_variable("prob", [], tf.float32,tf.constant_initializer(0.0))
counter = tf.get_variable("counter", [],tf.float32,tf.constant_initializer(0.0))
tf.assign(prob,0.0)
tf.assign(counter, 0.0)
# the predicate for stopping the while loop. Tensorflow demands that we have
# all of the variables used in the while loop in the predicate.
pred = lambda prob,counter,state,input,\
acc_state,acc_output,acc_probs:\
tf.logical_and(tf.less(prob,self.one_minus_eps), tf.less(counter,self.N))
acc_probs = []
acc_outputs = []
acc_states = []
_,iterations,_,_,acc_states,acc_output,acc_probs = \
control_flow_ops.while_loop(pred,
self.ACTStep,
[prob,counter,state,input,acc_states,acc_outputs,acc_probs])
# TODO:fix last part of this, need to use the remainder.
# TODO: find a way to accumulate the regulariser
# here we take a weighted combination of the states and outputs
# to use as the actual output and state which is passed to the next timestep.
next_state = tf.add_n([tf.mul(x,y) for x,y in zip(acc_probs,acc_states)])
output = tf.add_n([tf.mul(x,y) for x,y in zip(acc_probs,acc_outputs)])
return output, next_state
def ACTStep(self,prob,counter,state,input, acc_states,acc_outputs,acc_probs):
output, new_state = rnn.rnn(self.cell, [input], state, scope=type(self.cell).__name__)
prob_w = tf.get_variable("prob_w", [self.cell.input_size,1])
prob_b = tf.get_variable("prob_b", [1])
p = tf.nn.sigmoid(tf.matmul(prob_w,new_state) + prob_b)
acc_states.append(new_state)
acc_outputs.append(output)
acc_probs.append(p)
return [tf.add(prob,p),tf.add(counter,1.0),new_state, input,acc_states,acc_outputs,acc_probs]

I'm going to preface this response that this is NOT a complete solution, but rather some commentary on how to improve your cell.
To start off, in your ACTStep function, you call rnn.rnn for one timestep (as defined by [input]. If you're doing a single timestep, it is probably more efficient to simple use the actual self.cell call function. You'll see this same mechanism used in tensorflow rnncell wrappers
You mentioned that you have tried using TensorArrays. Did you pack and unpack the tensorarrays appropriately? Here is a repo where you'll find under model.py the tensorarrays are packed and unpacked properly.
You also asked if there is a function in control_flow_ops that will require all the tensors to be accumulated. I think you are looking for tf.control_dependencies
You can list all of your output tensors operations in control_dependicies and that will require tensorflow to compute all tensors up into that point.
Also, it looks like your counter variable is trainable. Are you sure you want this to be the case? If you're adding plus one to your counter, that probably wouldn't yield the correct result. On the other hand, you could have purposely kept it trainable to differentiate it at the end for the ponder cost function.
Also I believe the Remainder function should be in your script:
remainder = 1.0 - tf.add_n(acc_probs[:-1])
#note that there is a -1 in the list as you do not want to grab the last probability
Here is my version of your code edited:
class ACTCell(RNNCell):
"""An RNN cell implementing Graves' Adaptive Computation time algorithm
Notes: https://www.evernote.com/shard/s189/sh/fd165646-b630-48b7-844c-86ad2f07fcda/c9ab960af967ef847097f21d94b0bff7
"""
def __init__(self, num_units, cell, max_computation = 5.0, epsilon = 0.01):
self.one_minus_eps = tf.constant(1.0 - epsilon) #episolon is 0.01 as found in the paper
self._num_units = num_units
self.cell = cell
self.N = tf.constant(max_computation)
#property
def input_size(self):
return self._num_units
#property
def output_size(self):
return self._num_units
#property
def state_size(self):
return self._num_units
def __call__(self, inputs, state, scope=None):
with vs.variable_scope(scope or type(self).__name__):
# define within cell constants/ counters used to control while loop
prob = tf.constant(0.0, shape = [batch_size])
counter = tf.constant(0.0, shape = [batch_size])
# the predicate for stopping the while loop. Tensorflow demands that we have
# all of the variables used in the while loop in the predicate.
pred = lambda prob,counter,state,input,acc_states,acc_output,acc_probs:\
tf.logical_and(tf.less(prob,self.one_minus_eps), tf.less(counter,self.N))
acc_probs, acc_outputs, acc_states = [], [], []
_,iterations,_,_,acc_states,acc_output,acc_probs = \
control_flow_ops.while_loop(
pred,
self.ACTStep, #looks like he purposely makes the while loop here
[prob, counter, state, input, acc_states, acc_outputs, acc_probs])
'''mean-field updates for states and outputs'''
next_state = tf.add_n([x*y for x,y in zip(acc_probs,acc_states)])
output = tf.add_n([x*y for x,y in zip(acc_probs,acc_outputs)])
remainder = 1.0 - tf.add_n(acc_probs[:-1]) #you take the last off to avoid a negative ponder cost #the problem here is we need to take the sum of all the remainders
tf.add_to_collection("ACT_remainder", remainder) #if this doesnt work then you can do self.list based upon timesteps
tf.add_to_collection("ACT_iterations", iterations)
return output, next_state
def ACTStep(self,prob, counter, state, input, acc_states, acc_outputs, acc_probs):
'''run rnn once'''
output, new_state = rnn.rnn(self.cell, [input], state, scope=type(self.cell).__name__)
prob_w = tf.get_variable("prob_w", [self.cell.input_size,1])
prob_b = tf.get_variable("prob_b", [1])
halting_probability = tf.nn.sigmoid(tf.matmul(prob_w,new_state) + prob_b)
acc_states.append(new_state)
acc_outputs.append(output)
acc_probs.append(halting_probability)
return [p + prob, counter + 1.0, new_state, input,acc_states,acc_outputs,acc_probs]
def PonderCostFunction(self, time_penalty = 0.01):
'''
note: ponder is completely different than probability and ponder = roe
the ponder cost function prohibits the rnn from cycling endlessly on each timestep when not much is needed
'''
n_iterations = tf.get_collection_ref("ACT_iterations")
remainder = tf.get_collection_ref("ACT_remainder")
return tf.reduce_sum(n_iterations + remainder) #completely different from probability
This is a complicated paper to implement that I have been working on myself. I wouldn't mind collaborating with you to get it done in Tensorflow. If you're interested, please add me at LeavesBreathe on Skype and we can go from there.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Tensorflow: OOM when batch size too large - tensorflow

Related

How to minimise a multivariate cost function in Julia with Optim?

tensoflow sparse tensor efficient batching

consistent forward / backward pass with tensorflow dropout

RNN Slow-down phenomenon of Tensorflow

Using TensorArrays in the context of a while_loop to accumulate values

Categories

Resources