Blocking of tf.contrib.StagingArea get() and put() operations - tensorflow

Work environment
TensorFlow release version : 1.3.0-rc2
TensorFlow git version : v1.3.0-rc1-994-gb93fd37
Operating System : CentOS Linux release 7.2.1511 (Core)
Problem Scenario
I am using TensorFlow StagingArea ops for increasing the efficiency of my input pipeline. Here is a part of my code snippet which constructs the input pipeline :
train_put_op_list = []
train_get_op_list = []
val_put_op_list = []
val_get_op_list = []
with tf.variable_scope(tf.get_variable_scope()) as vscope:
for i in range(4):
with tf.device('/gpu:%d'%i):
with tf.name_scope('GPU-Tower-%d'%i) as scope:
trainstagingarea = tf.contrib.staging.StagingArea(dtypes=[tf.float32, tf.int32],
shapes=[[64, 221, 221, 3],[64]],
capacity=0)
valstagingarea = tf.contrib.staging.StagingArea(dtypes=[tf.float32, tf.int32],
shapes=[[128, 221, 221, 3],[128]],
capacity=0)
train_put_op_list.append(trainstagingarea.put(train_iterator.get_next()))
val_put_op_list.append(valstagingarea.put(val_iterator.get_next()))
train_get_op_list.append(trainstagingarea.get())
val_get_op_list.append(valstagingarea.get())
with tf.device('/cpu:0'):
worktype = tf.get_variable("wt",[], initializer=tf.zeros_initializer(), trainable=False)
workcondition = tf.equal(worktype, 1)
#elem = tf.cond(workcondition, lambda: train_iterator.get_next(), lambda: val_iterator.get_next())
elem = tf.cond(workcondition, lambda: train_get_op_list[i], lambda: val_get_op_list[i])
# This is followed by the network construction and optimizer
Now at the time of execution, I first run the put() ops a couple of times and then go on to run the iterations. It is shown below :
with tf.Session(config=config) as sess:
sess.run(init_op)
sess.run(iterator_training_op)
sess.run(iterator_validation_op)
sess.run(tf.assign(worktype, 0))
for i in range(4):
sess.run(train_put_op_list)
sess.run(val_put_op_list)
writer = tf.summary.FileWriter('.', graph=tf.get_default_graph())
epoch = 0
iter = 0
previous = 0
while(epoch<10):
try:
if(PROCESSINGTYPE is 'validation'):
sess.run(val_put_op_list)
[val_accu, summaries, numsamp] = sess.run([running_accuracy, validation_summary_op, processed])
previous+=numsamp
print("Running Accuracy = {} : Number of sample processed = {} ".format(val_accu, previous))
else:
sess.run(train_put_op_list)
[loss_value, _, train_accu, summaries, batch_accu, numsamp] = sess.run([total_loss, apply_gradient_op, running_accuracy, training_summary_op, batch_accuracy, pr\
ocessed])
#Remaining part of the code (not important for question)
Problem Description
The use of StagingArea improves the speed substantially (almost 3-4 times).
However, the code hangs due to some block. I am not sure if the block comes from get() or put() operations. Here is the actual output :
# Validation is done first and the following is the output
Running Accuracy = 0.0 : Number of sample processed = 512
Running Accuracy = 0.00390625 : Number of sample processed = 1024
Running Accuracy = 0.0 : Number of sample processed = 1536
Running Accuracy = 0.001953125 : Number of sample processed = 2048
# The code hangs here
You can notice that in the beginning of tf.Session() as sess:, the get() and put() ops were run for 4 times. The output is limited to 4 lines as well. This means that,
sess.run(val_put_op_list) within the while loop does not do anything. So, when the get() is called by sess.run(running_accuracy)..., the StagingArea is found empty after 4 lines and hence a blocking happens.
Am I correct in my analysis of the problem ?
What is the correct way to use the get() and put() ops here ?
If StagingArea is full and put() is blocked, would that also block the whole code ? TensorFlow documentation does not say anything about it.

Take a look at https://github.com/tensorflow/tensorflow/pull/13684. This resolves some deadlocks and will likely go into 1.4.0. Disclaimer: am not a tensorflower.

Related

python multiprocessing pool.map hangs when calling tensorflow/keras model

I use pool.map from multiprocessing to parallelize my python code. When I call my tensorflow/keras model with pool.map, the code hangs if my neural network is larger than a certain size. I still have plenty of RAM available, and calling the model outside of pool works fine.
I use python 3.7, tensorflow 2.3 on linux.
A mwe is provided below, it is also on colab:
def my_function(i):
a = MODEL(np.array(i).reshape(1,1))
print('foo', i)
return a
THREADS = os.cpu_count()
N = 4
NEURONS = 150000 # works for 100000, hangs for 150000
MODEL = tf.keras.Sequential([tf.keras.layers.Dense(NEURONS, input_shape=(1,))])
my_function(10) # works fine
pool = multiprocessing.Pool(THREADS)
_ = pool.map(my_function, range(N)) # hangs
pool.close()
pool.join()
Any idea what the issue is? How can I call a large model in parallel?
Edit: the size of a is not the issue, and the code hangs only if tf.keras is called once outside of pool, see mwe below and colab. The critical number of neurons is lower than in the original example. Any idea?
def my_function(i):
print('start', i)
model = tf.keras.Sequential([tf.keras.layers.Dense(NEURONS, input_shape=(1,))])
print('finish', i)
return None
THREADS = os.cpu_count()
N = 4
NEURONS = 20000 # works with 10000, not with 20000
# works
pool = multiprocessing.Pool(THREADS)
_ = pool.map(my_function, range(N))
pool.close()
pool.join()
# works
my_function(10)
# doesn't work if many neurons
pool = multiprocessing.Pool(THREADS)
_ = pool.map(my_function, range(N))
pool.close()
pool.join()

Tensorflow: OOM when batch size too large

My script is failing due to too high memory usage. When I reduce the batch size it works.
#tf.function(autograph=not DEBUG)
def step(prev_state, input_b):
input_b = tf.reshape(input_b, shape=[1,input_b.shape[0]])
state = FastALIFStateTuple(v=prev_state[0], z=prev_state[1], b=prev_state[2], r=prev_state[3])
new_b = self.decay_b * state.b + (tf.ones(shape=[self.units],dtype=tf.float32) - self.decay_b) * state.z
thr = self.thr + new_b * self.beta
z = state.z
i_in = tf.matmul(input_b, W_in)
i_rec = tf.matmul(z, W_rec)
i_t = i_in + i_rec
I_reset = z * thr * self.dt
new_v = self._decay * state.v + (1 - self._decay) * i_t - I_reset
# Spike generation
is_refractory = tf.greater(state.r, .1)
zeros_like_spikes = tf.zeros_like(z)
new_z = tf.where(is_refractory, zeros_like_spikes, self.compute_z(new_v, thr))
new_r = tf.clip_by_value(state.r + self.n_refractory * new_z - 1,
0., float(self.n_refractory))
return [new_v, new_z, new_b, new_r]
#tf.function(autograph=not DEBUG)
def evolve_single(inputs):
accumulated_state = tf.scan(step, inputs, initializer=state0)
Z = tf.squeeze(accumulated_state[1]) # -> [T,units]
if self.model_settings['avg_spikes']:
Z = tf.reshape(tf.reduce_mean(Z, axis=0), shape=(1,-1))
out = tf.matmul(Z, W_out) + b_out
return out # - [BS,Num_labels]
# # - Using a simple loop
# out_store = []
# for i in range(fingerprint_3d.shape[0]):
# out_store.append(tf.squeeze(evolve_single(fingerprint_3d[i,:,:])))
# return tf.reshape(out_store, shape=[fingerprint_3d.shape[0],self.d_out])
final_out = tf.squeeze(tf.map_fn(evolve_single, fingerprint_3d)) # -> [BS,T,self.units]
return final_out
This code snippet is inside a tf.function, but I omitted it since I don't think it's relevant.
As can be seen, I run the code on fingerprint_3d, a tensor that has the dimension [BatchSize,Time,InputDimension], e.g. [50,100,20]. When I run this with BatchSize < 10 everything works fine, although tf.scan already uses a lot of memory for that.
When I now execute the code on a batch of size 50, suddenly I get an OOM, even though I am executing it in an iterative matter (here commented out).
How should I execute this code so that the Batch Size doesn't matter?
Is tensorflow maybe parallelizing my for loop so that it executed over multiple batches at once?
Another unrelated question is the following: What function instead of tf.scan should I use if I only want to accumulate one state variable, compared to the case for tf.scan where it just accumulates all the state variables? Or is that possible with tf.scan?
As mentioned in the discussions here, tf.foldl, tf.foldr, and tf.scan all require keeping track of all values for all iterations, which is necessary for computations like gradients. I am not aware of any ways to mitigate this issue; still, I would also be interested if anyone has a better answer than mine.
When I used
#tf.function
def get_loss_and_gradients():
with tf.GradientTape(persistent=False) as tape:
logits, spikes = rnn.call(fingerprint_input=graz_dict["train_input"], W_in=W_in, W_rec=W_rec, W_out=W_out, b_out=b_out)
loss = loss_normal(tf.cast(graz_dict["train_groundtruth"],dtype=tf.int32), logits)
gradients = tape.gradient(loss, [W_in,W_rec,W_out,b_out])
return loss, logits, spikes, gradients
it works.
When I remove the #tf.function decorator the memory blows up. So it really seems important that tensorflow can create a graph for you computations.

TPU slower than GPU?

I just tried using TPU in Google Colab and I want to see how much TPU is faster than GPU. I got surprisingly the opposite result.
The following is the NN.
random_image = tf.random_normal((100, 100, 100, 3))
result = tf.layers.conv2d(random_image, 32, 7)
result = tf.reduce_sum(result)
Performance results:
CPU: 8s
GPU: 0.18s
TPU: 0.50s
I wonder why.... The complete code for TPU is as follows:
def calc():
random_image = tf.random_normal((100, 100, 100, 3))
result = tf.layers.conv2d(random_image, 32, 7)
result = tf.reduce_sum(result)
return result
tpu_ops = tf.contrib.tpu.batch_parallel(calc, [], num_shards=8)
session = tf.Session(tpu_address)
try:
print('Initializing global variables...')
session.run(tf.global_variables_initializer())
print('Warming up...')
session.run(tf.contrib.tpu.initialize_system())
print('Profiling')
start = time.time()
session.run(tpu_ops)
end = time.time()
elapsed = end - start
print(elapsed)
finally:
session.run(tf.contrib.tpu.shutdown_system())
session.close()
Benchmarking devices properly is hard, so please take everything you learn from these examples with a grain of salt. It's better in general to compare specific models you are interested in (e.g. running an ImageNet network) to understand performance differences. That said, I understand it's fun to do this, so...
Larger models will illustrate the TPU and GPU performance better. Your example also is including the compilation time in the cost of the TPU call: every call after the first for a given program and shape will be cached, so you will want to tpu_ops once before starting the timer unless you want to capture the compilation time.
Currently each call to a TPU function copies the weights to the TPU before it can start running, this affects small operations more significantly. Here's an example that runs a loop on the TPU before returning to the CPU, with the following outputs.
1 0.010800600051879883
10 0.09931182861328125
100 0.5581905841827393
500 2.7688047885894775
. So you can actually run 100 iterations of this function in 0.55s.
import os
import time
import tensorflow as tf
def calc(n):
img = tf.random_normal((128, 100, 100, 3))
def body(_):
result = tf.layers.conv2d(img, 32, 7)
result = tf.reduce_sum(result)
return result
return tf.contrib.tpu.repeat(n[0], body, [0.0])
session = tf.Session('grpc://' + os.environ['COLAB_TPU_ADDR'])
try:
print('Initializing TPU...')
session.run(tf.contrib.tpu.initialize_system())
for i in [1, 10, 100, 500]:
tpu_ops = tf.contrib.tpu.batch_parallel(calc, [[i] * 8], num_shards=8)
print('Warming up...')
session.run(tf.global_variables_initializer())
session.run(tpu_ops)
print('Profiling')
start = time.time()
session.run(tpu_ops)
end = time.time()
elapsed = end - start
print(i, elapsed)
finally:
session.run(tf.contrib.tpu.shutdown_system())
session.close()

In what order does TensorFlow evaluate nodes in a computation graph?

I am having a strange bug in TensorFlow. Consider the following code, part of a simple feed-forward neural network:
output = (tf.matmul(layer_3,w_out) + b_out)
prob = tf.nn.sigmoid(output);
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits = output, targets = y_, name=None))
optimizer = tf.train.GradientDescentOptimizer(learning_rate = learning_rate).minimize(loss, var_list = model_variables)`
(Notice that prob is not used to define the loss function. This is because sigmoid_cross_entropy applies sigmoid internally in its definition)
I later run the optimizer in the following line:
result,step_loss,_ = sess.run(fetches = [output,loss,optimizer],feed_dict = {x_ : np.array([[x,y,x*x,y*y,x*y]]), y_ : [[1,0]]});
The above works just fine. However, if I instead run the following line to run the code, the network seems to perform terribly, even though there shouldn't be any difference!
result,step_loss,_ = sess.run(fetches = [prob,loss,optimizer],feed_dict = {x_ : np.array([[x,y,x*x,y*y,x*y]]), y_ : [[1,0]]});
I have a feeling it has something to do with the order in which TF computes the nodes in the graph during a session, but I'm not sure. What could the issue be?
It's not an issue with the graph, it's just that you are looking at different things.
In the first example you provide:
result,step_loss,_ = sess.run(fetches = [output,loss,optimizer],feed_dict = {x_ : np.array([[x,y,x*x,y*y,x*y]]), y_ : [[1,0]]})
you are saving the result of running the output op in the result python variable.
In the second one:
result,step_loss,_ = sess.run(fetches = [prob,loss,optimizer],feed_dict = {x_ : np.array([[x,y,x*x,y*y,x*y]]), y_ : [[1,0]]})
you are saving the result of the prob op in the result python variable.
Since both ops are different it is to be expected that the values returned by them would be different.
You could run
logits, activation, step_loss, _ = sess.run(fetches = [output, prob, loss, optimizer], ...)
to check your results.

TensorFlow : Enqueuing and dequeuing a queue from multiple threads

The problem I am trying to solve is as follows :
I have a list trainimgs of filenames. I have defined a
tf.RandomShuffleQueue with its capacity=len(trainimgs) and min_after_dequeue=0.
This tf.RandomShuffleQueue is expected to be filled by trainimgs for a specified epochlimit number of times.
A number of threads are expected to work in parallel. Each thread dequeues an element from the tf.RandomShuffleQueue and does some operations on it and enqueues it to another queue. I have got that part right.
However once 1 epoch of trainimgs have been processed and the tf.RandomShuffleQueue is empty, provided that the current epoch e < epochlimit, the queue must again be filled up and the threads must work again.
The good news is : I have got it working in a certain case (See PS at the end !!)
The bad news is : I think that there is a better way of doing this.
The method I am using to do this now is as follows (I have simplified the functions and have removed e image processing based preprocessing and subsequent enqueuing but the heart of the processing remains the same !!) :
with tf.Session() as sess:
train_filename_queue = tf.RandomShuffleQueue(capacity=len(trainimgs), min_after_dequeue=0, dtypes=tf.string, seed=0)
queue_size = train_filename_queue.size()
trainimgtensor = tf.constant(trainimgs)
close_queue = train_filename_queue.close()
epoch = tf.Variable(initial_value=1, trainable=False, dtype=tf.int32)
incrementepoch = tf.assign(epoch, epoch + 1, use_locking=True)
supplyimages = train_filename_queue.enqueue_many(trainimgtensor)
value = train_filename_queue.dequeue()
init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
sess.run(init_op)
coord = tf.train.Coordinator()
tf.train.start_queue_runners(sess, coord)
sess.run(supplyimages)
lock = threading.Lock()
threads = [threading.Thread(target=work, args=(coord, value, sess, epoch, incrementepoch, supplyimages, queue_size, lock, close_queue)) for i in range(200)]
for t in threads:
t.start()
coord.join(threads)
The work function is as follows :
def work(coord, val, sess, epoch, incrementepoch, supplyimg, q, lock,\
close_op):
while not coord.should_stop():
if sess.run(q) > 0:
filename, currepoch = sess.run([val, epoch])
filename = filename.decode(encoding='UTF-8')
print(filename + ' ' + str(currepoch))
elif sess.run(epoch) < 2:
lock.acquire()
try:
if sess.run(q) == 0:
print("The previous epoch = %d"%(sess.run(epoch)))
sess.run([incrementepoch, supplyimg])
sz = sess.run(q)
print("The new epoch = %d"%(sess.run(epoch)))
print("The new queue size = %d"%(sz))
finally:
lock.release()
else:
try:
sess.run(close_op)
except tf.errors.CancelledError:
print('Queue already closed.')
coord.request_stop()
return None
So, although this works, I have a feeling that there is a better and cleaner way to achieve this. So, in a nutshell my questions are :
Is there a simpler and cleaner way of achieving this task in TensorFlow ?
Is there any problem with this code's logic ? I am not very experienced with multithreading scenarios, so any obvious faults which have skipped my attention would be very helpful to me.
P.S : It seems that this code is not Perfect after all. When I ran with 1.2 million images and 200 threads, it ran. However when I run it for 10 images and 20 threads, it gives the following error :
CancelledError (see above for traceback): RandomShuffleQueue '_0_random_shuffle_queue' is closed.
[[Node: random_shuffle_queue_EnqueueMany = QueueEnqueueManyV2[Tcomponents=[DT_STRING], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](random_shuffle_queue, Const)]]
I thought I got that covered by except tf.errors.CancelledError. What the hell is going on here ?
I finally found out the answer. The problem was that multiple threads were clashing together on various points in the work() function.
The following work() function works perfectly.
def work(coord, val, sess, epoch, maxepochs, incrementepoch, supplyimg, q, lock, close_op):
print('I am thread number %s'%(threading.current_thread().name))
print('I can see a queue with size %d'%(sess.run(q)))
while not coord.should_stop():
lock.acquire()
if sess.run(q) > 0:
filename, currepoch = sess.run([val, epoch])
filename = filename.decode(encoding='UTF-8')
tid = threading.current_thread().name
print(filename + ' ' + str(currepoch) + ' thread ' + str(tid))
elif sess.run(epoch) < maxepochs:
print('Thread %s has acquired the lock'%(threading.current_thread().name))
print("The previous epoch = %d"%(sess.run(epoch)))
sess.run([incrementepoch, supplyimg])
sz = sess.run(q)
print("The new epoch = %d"%(sess.run(epoch)))
print("The new queue size = %d"%(sz))
else:
coord.request_stop()
lock.release()
return None
I recommend having a single thread calling enqueue_many epochs times enqueue the correct number of images. It can then close the queue. This would let you simplify your work function and other threads.
I think the GIL will prevent any actual parallelism from being done in those threads.
To get performance with tensorflow you need to keep your data in tensorflow.
Tensor Flow's reading data guide explains how to address a very similar sort of problem.
More specifically, you seem to have rewritten a significant chunk of string_input_producer.