I have some fairly large batch sizes on which I'd like to take multiple gradient steps. While I could easily do this with a python for loop, I imagine that there might be a more efficient method that doesn't involve transferring the data to gpu on each iteration. I've tried putting the train op in the fetch list multiple times, but I'm not sure that it's actually being run more than once (the runtime is exactly the same).
If you have variable-sized batch then variable is a bad fit for saving it, and you could instead persist this data between run calls using peristent tensors. Here's a toy example
t = tf.int32
params = tf.Variable(tf.ones_initializer((), dtype=dt))
data_batches = [[1], [2, 3], [4, 5, 6]]
# op that uploads data to TF and saves it as a persistent Tensor
data_saver_placeholder = tf.placeholder(dt)
tensor_handle_op = tf.get_session_handle(data_saver_placeholder)
data_placeholder, data = tf.get_session_tensor(dt)
train_op = tf.assign_add(params, tf.reduce_prod(data))
init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)
for batch in data_batches:
# upload tensor to TF runtime and save its handle
tensor_handle = sess.run(tensor_handle_op, feed_dict={data_saver_placeholder: batch})
# run train op several times reusing same data
for i in range(3):
sess.run(train_op, feed_dict={data_placeholder: tensor_handle.handle})
assert sess.run(params) == 382
If you do sess.run([myop,myop]) that'll only run myop once.
If you want to run the op, but not fetch its results to Python runtime you could use a control dependency. A simple way to do this is with a group op, ie
sess.run(tf.group(myop))
sess.run(tf.group(myop))
If your graph is large you may get an extra overhead by constructing group op between run calls (maybe 10-100ms for >10k node graph), so you could construct it ahead of time
myop_nooutput = tf.group(myop)
sess.run(myop_nooutput)
sess.run(myop_nooutput)
Related
If I use multiple elements from a tf.data.Dataset dataset to build the graph, and then evaluate the graph later, it seems the order the element from the Dataset is undefined. As an example, the following code snippet
import tensorflow as tf
dataset = tf.data.Dataset.range(5)
iterator = dataset.make_one_shot_iterator()
print 'build graph and then eval'
keep = []
for i in range(5):
keep.append(iterator.get_next())
with tf.Session() as sess:
keep_eval = sess.run(keep)
print keep_eval
print 'eval each element'
with tf.Session() as sess:
for i in range(5):
print sess.run(iterator.get_next()),
will result in output like:
build graph and then eval
[3 0 1 4 2]
eval each element
0 1 2 3 4
Also, each run will yield different "build graph and then eval".
I would expect "build graph and then eval" to be ordered as well like "eval each element". Can anyone explain why this happens?
The order of a tf.data.Dataset is defined and deterministic (unless you add a non-deterministic Dataset.shuffle()).
However, your two loops build different graphs, which accounts for the difference:
The "build graph and then eval" part creates a list of five iterator.get_next() operations and runs the five operations in parallel. Because these operations run in parallel, they may produce results in different order.
The "eval each element" part also creates five iterator.get_next() operations, but it runs them sequentially, so you always get the results in the expected order.
Note that we do not recommend calling iterator.get_next() in a loop, because it creates a new operation on each call, which gets added to the graph, and consumes memory. Instead, when you loop over a Dataset, try to use the following pattern:
dataset = tf.data.Dataset.range(5)
iterator = dataset.make_one_shot_iterator()
# Call `iterator.get_next()` once and use the result in each iteration.
next_element = iterator.get_next()
with tf.Session() as sess:
for i in range(5):
print sess.run(next_element)
From the TensorFlow FAQs here
The individual ops have parallel implementations, using multiple cores in a CPU, or multiple threads in a GPU.
So your "build graph then eval" call runs in parallel for each element in the list, which is why the numbers are in random order, while the for loop forces one call to be run after another, so its serial. You can verify by timing both, the first one should be fast, the for loop will be slower.
I'm working with tf.data.dataset/iterator mechanism and trying to improve data loading performance. It occurred to me that offloading the entire minibatch loop from Python might help. My data is small enough that storing on CPU or GPU is no problem.
So, Is it possible to loop an optimizer node over a full minibatched epoch within a call to session.run?
The tensor returned by iterator.get_next() is only incremented once per session.run, which would seems to make it impossible to iterate through a dataset of minibatches... but if it could be done, my CPU would only have to touch the Python thread once per epoch.
UPDATE: #muskrat's suggestion to use tf.slice can be used for this purpose. See my subsequent non-answer with a schematic implementation of this using tf.while_loop. However, the question is whether this can be accomplished using dataset/iterators... and I'd still like to know.
From the description it seems that you already have the dataset preloaded as a constant on CPU/GPU, like at this example. That's certainly the first step.
Second, I suggest using tf.slice() to replicate the effect of the minibatch operation. In other words, just manually slice minibatches out of the preloaded constant (your dataset), and you should get the desired behavior. See for example the slice docs or this related post.
If that's not enough detail, please edit your question to include a code example (with mnist or something) and I can give more details.
This "answer" is an implementation of muskrat's tf.slice suggestion with the details of tf.while_loop worked out (with help from How to use tf.while_loop() in tensorflow and https://www.tensorflow.org/api_docs/python/tf/while_loop).
Unless your data and model are small enough that you're bottlenecked by Python I/O (like me!), this solution is probably academic.
Advantages:
Trains over minibatches without returning to the Python thread.
Uses only ops that have GPU implementations meaning that the entire graph can be placed in the GPU.
On my small dataset, which is presumably bottlenecked by Python I/O, this solution is twice the speed of my dataset/iteratior (which touches Python once per minibatch) and four times the speed of passing minibatches through feed_dict.
Disadvantages:
tf.while_loop is treacherous. It's challenging to understand when ops inside the loop's body are evaluated and when those they depend on are evaluated, particularly the (thin) official documentation and limited Stack Overflow coverage.
The missing documentation of tf.while_loop is that tensors outside the body of the loop are only evaluated once, even if inner ops depend on them. This means that optimization, model, and loss have to be defined in the loop. This limits flexibility if you'd like to e.g. be able to call validation loss ops between training epochs. Presumably this could be accomplished with tf.cond statements and the appropriate flags passed in via feed_dict. But not nearly as flexible or elegant as the dataset/iterator mechanism in tf.data.
Adding shuffling operations at each Epoch doesn't seem available on GPU.
Here's my schematic code (I've ommitted the variable and model definition for brevity):
def buildModel(info, training_data, training_targets):
graph = tf.Graph()
with graph.as_default():
# numBatches is passed in from Python once per Epoch.
batch_size = tf.placeholder(tf.float32, name = 'batch_size')
# Initializers for loop variables for tf.while_loop
batchCounter = tf.Variable(0, dtype=tf.float32, trainable=False)
lossList = tf.Variable(tf.zeros([0,1]), trainable=False)
# In a full example, I'd normalize my data here. And possibly shuffle
tf_training_data = tf.constant(training_data, dtype=tf.float32)
tf_training_targets = tf.constant(training_targets, dtype=tf.float32)
# For brevity, I'll spare the definitions of my variables. Because tf.Variables
# are essentially treated as globals in the model and are manipulated directly (like with tf.apply)
# they can reside outside runMinibatch, the body of tf.while_loop.
# weights_1 =
# biases_1 =
# etc.
def moreMinibatches(batchCount, lossList):
return (batchCount + 1) * batch_size <= len(training_data)
def runMinibatch(batchCount, lossList):
# These tensors and ops have to be defined inside runMinibatch, otherwise they're not updated as tf.wile_loop loops. This means
# slices, model definition, loss tensor, and training op.
dat_batch = tf.slice(tf_training_data, [tf.cast(batchCounter * batch_size, tf.int32) , 0], [tf.cast(batch_size, tf.int32), -1])
targ_batch = tf.slice(tf_training_targets, [tf.cast(batchCounter * batch_size, tf.int32) , 0], [tf.cast(batch_size, tf.int32), -1])
# Here's where you'd define the model as a function of weights and biases above and dat_batch
# model = <insert here>
loss = tf.reduce_mean(tf.squared_difference(model, targ_batch))
optimizer = tf.train.AdagradOptimizer() # for example
train_op = optimizer.minimize(while_loss, name='optimizer')
# control_dependences ensures that train_op is run before return
# even though the return values don't explicitly depend on it.
with tf.control_dependencies([train_op]):
return batchCount + 1, tf.concat([lossList, [[while_loss]]],0)
# So, the idea is that this trains a full epoch without returning to Python.
trainMinibatches = tf.while_loop(moreMinibatches, runMinibatch, [minibatchCounter, lossList]
shape_invariants=[batchCounter.get_shape(), tf.TensorShape(None)])
return (graph,
{'trainMinibatches' : trainAllMinibatches,
'minibatchCounter' : minibatchCounter,
'norm_loss' : norm_loss,
} )
numEpochs = 100 # e.g.
minibatchSize = 32 #
# training_dataset = <data here>
# training_targets = <targets here>
graph, ops = buildModel(info, training_dataset, training_targets,
minibatch_size)
with tf.Session(graph=graph, config=config) as session:
tf.global_variables_initializer().run()
for i in range(numEpochs):
# This op will train on as all minibatches that fit in the full dataset. finalBatchCount with be the number of
# complete minibatches in the dataset. lossList is a list of each step's minibatches.
finalBatchCount, lossList = session.run(ops['trainAllMinibatches'],
feed_dict={'batch_size:0':minibatchSize})
print('minibatch losses at Epoch', i, ': ', lossList)
I implemented tf.slice() and tf.while_loop approach to vectorize mini-batch suggested above.
The performance was about 1.86 times faster in my case than the mini-batches using feed_dict, but I found there was a problem that the loss values of each epochs were not stabilized.
Then, I changed to tf.random_shuffle the inputs every epoch, the problem was much mitigated. (the performance gain was reduced to 1.68 times)
In my TensorFlow code I want my network to take inputs from one of the two StagingArea objects depending upon whether I want to do training or testing.
A part of the graph construction code I wrote is as follows :
with tf.device("/gpu:0"):
for i in range(numgpus):
with tf.variable_scope(tf.get_variable_scope(), reuse=i>0) as vscope:
with tf.device('/gpu:{}'.format(i)):
with tf.name_scope('GPU-Tower-{}'.format(i)) as scope:
phase = tf.get_variable("phase", [], initializer=tf.zeros_initializer(),dtype=tf.uint8, trainable=False)
phaseassigntest = phase.assign(1)
phaseassigntrain = phase.assign(0)
phasetest = tf.equal(phase, 0)
is_training = tf.cond(phasetest, lambda: tf.constant(True), lambda: tf.constant(False))
trainstagingarea = tf.contrib.staging.StagingArea([tf.float32, tf.int32], shapes=[[trainbatchsize, 3, 221, 221], [trainbatchsize]], capacity=20)
putoptrain = trainstagingarea.put(train_iterator.get_next())
trainputop.append(putoptrain)
getoptrain = trainstagingarea.get()
traingetop.append(getoptrain)
trainclearop = trainstagingarea.clear()
trainstageclear.append(trainclearop)
trainsizeop = trainstagingarea.size()
trainstagesize.append(trainsizeop)
valstagingarea = tf.contrib.staging.StagingArea([tf.float32, tf.int32], shapes=[[valbatchsize, 3, 221, 221], [valbatchsize]], capacity=20)
putopval = valstagingarea.put(val_iterator.get_next())
valputop.append(putopval)
getopval = valstagingarea.get()
valgetop.append(getopval)
valclearop = valstagingarea.clear()
valstageclear.append(valclearop)
valsizeop = valstagingarea.size()
valstagesize.append(valsizeop)
#elem = valgetop[i]
elem = tf.cond(is_training,lambda: traingetop[i],lambda: valgetop[i])
img = elem[0]
label = elem[1]
labelonehot = tf.one_hot(label, depth=numclasses)
net, networksummaries = overfeataccurate(img,numclasses=numclasses, phase=is_training)
I have used tf.cond to make sure that the network is fed by one of the two StagingArea objects. One is meant for training and the other one is meant for validation.
Now, when I try to execute the graph as follows, I do not get any result and infact the code just hangs and I have to kill the process.
with tf.Session(graph=g,config=config) as sess:
sess.run(init_op)
sess.run(tf.local_variables_initializer())
sess.run(val_initialize)
for i in range(20):
sess.run(valputop)
print(sess.run(valstagesize))
writer = tf.summary.FileWriter('.', graph=tf.get_default_graph())
epoch = 0
iter = 0
print("Performing Validation")
sess.run(phaseassigntest)
saver = tf.train.Saver()
while(epoch<10):
time_init = time.time()
while True:
try:
[val_accu, _, summaries] = sess.run([towervalidation, towervalidationupdateop,validation_summary_op])
print(val_accu)
when instead of tf.cond() I directly assign elem = valgetop[i], the code works just fine.
Am I missing something over here ?
What is the right way to feed my network based on whether I want to do training or testing ?
NOTE The error does not go away even if I set numgpus to 1.
Your problem
What you think tf.cond does
Based on the flag, execute what is required to put either traingetop[i] or valgetop[i] into your elem tensor.
What tf.cond actually does
Executes what is required to get both traingetop[i] and valgetop[i], then passes one of them into your elem tensor.
So
The reason it is hanging forever is because it's waiting for an element to be added to your training staging area (so that it can get that element and discard it). You're forgiven for not realising this is what it's doing; it's actually very counter-intuitive. The documentation is awfully unclear on how to deal with this.
Recommended Solution (by Tensorflow documentation)
If you really need the queues to be in the same graph, then you need to make two copies of your ENTIRE graph, one that is fed by your training staging area, and one that is fed by your validation staging area. Then you just use the relevant tensor in your sess.run call. I recommend creating a function that takes a queue output tensor, and returns a model_output tensor. Now you have a train_time_output tensor and a validation_time_output tensor, and you can choose which one you want to execute in your sess.run.
Warning
You need to make sure that you aren't actually creating new variables to go along with these new ops. To do that take a look at the latest documentation on variables. It looks like they've simplified it from v0.12, and it essentially boils down to using tf.get_variable instead of tf.Variable to create your variables.
My preferred work around
Although that is the recommended solution (AFAIK), it is extremely unsatisfying to me; you're creating a whole other set of operations on the graph that just happen to use the same weights. It seems like there's a lot of potential for programmer error by abusing the separation between train time and test/validation time (resulting in the model acting unexpectedly different at these times). Worse; it doesn't solve the problem of tf.cond demanding the values for inputs to both branches, it just forces you to copy your whole graph, which is not always possible.
I prefer to just not have my queues in the graph like that, and treat the model as a function which can be fed an example without caring where it's from. That is, I would instantiate the model with a tf.placeholder as the input, and at execution time I would use feed_dict to actually provide the value. It would function something like this
#inside main training loop
if time_to_train:
example = sess.run(traingettop)
else:
example = sess.run(valgettop)
result = sess.run(model_output, {input_placeholder: example})
It's very useful to note that you can use the feed_dict to feed any value for any tensor anywhere in your model. So, you can change any model definition that, due to tf.cond would always require an input, like:
a = tf.constant(some_value)
b = tf.placeholder(tf.float32)
flag = tf.placeholder(tf.bool, [])
one_of_them = tf.cond(flag, a, b)
model_output = build_graph(one_of_them)
Into a definition that doesn't, like:
a = tf.constant(some_value)
model_output = build_graph(a)
Remembering that you can always overwrite what a is at execution time:
# In main training loop,
sess.run(train_op, {a: some_other_value})
This essentially pushes the conditional into native python land. In your code you might end up with something like:
if condition_satisfied:
sess.run(train_op, {a:some_other_value})
else:
sess.run(train_op)
Performance concerns
If you are using tensorflow on a single machine, then there is practically no performance cost to this solution, as the numpy array/s put into the example python variable are actually still stored on the GPU.
If you are using tensorflow in a distributed fashion, then this solution would kill your performance; it would require sending the example from whatever machine it's on to the master so that it can send it back.
I need to execute the statement, sess.run() multiple times.
I create the sess once at the beginning of my code. However, each sess.run() statement takes almost 0.5-0.8 seconds on my CPU machine. Is there any way I can optimize this? Since Tensorflow does lazy loading, is there any way I can make it not do it, and make this faster?
I'm using the Inception model from the image classifying example.
def load_network():
with gfile.FastGFile('model.pb', 'rb') as f:
graph_def = tf.GraphDef()
data = f.read()
graph_def.ParseFromString(data)
png_data = tf.placeholder(tf.string, shape=[])
decoded_png = tf.image.decode_png(png_data, channels=3)
_ = tf.import_graph_def(graph_def, name=input_map={'DecodeJpeg': decoded_png})
return png_data
def get_pool3(sess, png_data, imgBuffer):
pool3 = sess.graph.get_tensor_by_name('pool_3:0')
pool3Vector = sess.run(pool3, {png_data: imgBuffer.getvalue()})
return pool3Vector
def main():
sess = getTensorSession()
png_data = load_network()
# The below line needs to be called multiple times, which is what takes
# nearly 0.5-0.8 seconds.
# imgBuffer contains the stored value of the image.
pool3 = get_pool3(sess, png_data, imgBuffer)
Tensorflow runs operations lazily --- nothing is actually computed until sess.run() is called. When you call sess.run(), Tensorflow executes all of the operations in your computation graph. So if sess.run() is taking 0.5-0.8 seconds, it is likely that your computation itself is taking 0.5-0.8s.
(There is some overhead to sess.run(), but it shouldn't be anywhere near the order of half a second.)
Hope that helps!
Added:
Here are some things you might look into to speed your computation up:
use Tensorflow's profiling tools to look at what part of your computation is taking the time. They are not documented yet, but you can find some information about them in this github issue: https://github.com/tensorflow/tensorflow/issues/1824
make your computation cheaper --- reduce the complexity of your model, use smaller images, etc.
run your computation on a GPU instead of CPU.
I have been using Tensorflow with the l-bfgs optimizer from openopt. It was pretty easy to setup callbacks to allow Tensorflow to compute gradients and loss evaluations for the l-bfgs, however, I am having some trouble figuring out how to introduce stochastic elements like dropout into the training procedure.
During the line search, l-bfgs performs multiple evaluations of the loss function, which need to operate on the same network as the prior gradient evaluation. However, it seems that for each evaluation of the tf.nn.dropout function, a new set of dropouts is created. I am looking for a way to fix the dropout over multiple evaluations of the loss function, and then allow it to change between the gradient steps of the l-bfgs. I'm assuming this has something to do with the control flow ops in tensorflow, but there isn't really a good tutorial on how to use these and they are a little enigmatic to me.
Thanks for your help!
Drop-out relies on uses random_uniform which is a stateful op, and I don't see a way to reset it. However, you can hack around it by substituting your own random numbers and feeding them to the same input point as random_uniform, replacing the generated values
Taking the following code:
tf.reset_default_graph()
a = tf.constant([1, 1, 1, 1, 1], dtype=tf.float32)
graph_level_seed = 1
operation_level_seed = 1
tf.set_random_seed(graph_level_seed)
b = tf.nn.dropout(a, 0.5, seed=operation_level_seed)
Visualize the graph to see where random_uniform is connected
You can see dropout takes input of random_uniform through the Add op which has a name mydropout/random_uniform/(random_uniform). Actually the /(random_uniform) suffix is there for UI reasons, and the true name is mydropout/random_uniform as you can see by printing tf.get_default_graph().as_graph_def(). That gives you shortened tensor name. Now you append :0 to get actual tensor name. (side-note: operation could produce multiple tensors which correspond to suffixes :0, :1 etc. Since having one output is the most common case, :0 is implicit in GraphDef and node input is equivalent to node:0. However :0 is not implicit when using feed_dict so you have to explicitly write node:0)
So now you can fix the seed by generating your own random numbers (of the same shape as incoming tensor), and reusing them between invocations.
tf.reset_default_graph()
a = tf.constant([1, 1, 1, 1, 1], dtype=tf.float32)
graph_level_seed = 1
operation_level_seed = 1
tf.set_random_seed(graph_level_seed)
b = tf.nn.dropout(a, 0.5, seed=operation_level_seed, name="mydropout")
random_numbers = np.random.random(a.get_shape()).astype(dtype=np.float32)
sess = tf.Session()
print sess.run(b, feed_dict={"mydropout/random_uniform:0":random_numbers})
print sess.run(b, feed_dict={"mydropout/random_uniform:0":random_numbers})
You should see the same set of numbers with 2 run calls.