I'm playing around with the tf.contrib.data.Dataset API. Basically, I have an input tensor generated by dataset.batch(BATCH_SIZE).get_next() and I can currently do something like
for _ in range(100):
sess.run(train_op)
to repeat the train step 100 times. I don't need to feed anything, since all the inputs come from dataset. Doing this in a loop seems like a waste, so is there a way to tell TF to repeat the run step 100 times without having to fall back into Python code between every iteration?
I saw some similar questions about preventing the CPU-GPU transfer between iterations by feeding persistent tensors residing on the GPU, but that's a different problem.
Related
I have seen tutorials which use .repeat() while doing the loading, shuffling, mapping, batching, prefetching etc. for a Tensorflow dataset while there are others that completely skip it.
I know what repeat does and how it is used, but am not able to figure out when it is used and when it is not.
Any help?
It depends. Let's use MNIST as an example. Say we build a dataset using from_tensor_slices. The training dataset has 60000 samples.
Let's say we use batch size 100 and do not use repeat. This means the dataset will provide 600 batches. Now, if we try to train a model, for example using the keras fit interface, the dataset will simply run out of samples after 600 steps! We will not be able to train more than that. Using repeat, the dataset will instead simply "start fresh" once it runs out, and we can train as long as we like.
Other tutorials might use a manual training loop. Perhaps you have a loop like
for batch in data_set:
...
In this example, once again, the loop will simply stop after 600 batches if we do not use repeat. However, we can do this:
for epoch in range(n_epochs):
for batch in data_set:
...
In this example, we specify the number of passes over the dataset in n_epochs. The inner loop stops after 600 batches, but then the outer loop (epoch) simply increments by 1, and the inner loop starts again. This way, we can have more than 600 batches even without using repeat.
Finally, there are of course other ways to create datasets. For example, from_generator can be used to stream a dataset from a Python generator that can run infinitely long, so repeat is not necessary at all.
Without having seen the tutorials you are referring to, I can only guess that the differences regarding use of repeat can be explained by differences in how the training loop is coded, such as the above.
fairly new to LSTM, but I already searched for a solution and could not find anything satisfying or even similar enough.
So here is my problem:
I am dealing with sleep classifaction and have annotated records for about 6k patients.
To train my bidirectional LSTM, I pick one patient and fit the model on that data instead of putting all the data from all the patients into one big matrix because I want to prevent patient samples mixing when Keras is training with mini batches.
The sequence length or samples_size per patient are not the same.
Then I loop over all patients and do a additional loop for the number of epochs I considered to train the model for (as described in Developer Guides).
So since LSTMs (if not stateful) reset their cell and hidden state after each batch and the default batch_size for tf.keras.Sequential.fit() is 32 I wanted it to match the sample_size of the patient I am showing to the network. If I do so I am getting a warning and the training process errors after some time. The error is:
WARNING:tensorflow:6 out of the last 11 calls to .distributed_function at 0x0000023F9D517708> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings is likely due to passing python objects instead of tensors. Also, tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. Please refer to https://www.tensorflow.org/beta/tutorials/eager/tf_function#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for more details.
So I looked up what my longest sample_size is and set my batch_size accordingly.
tl;dr: What is Keras doing in all the instances where my variable sample_size is not matching my batch_size=max(len(sample_size))?
Is it just showing the available samples to the network?
If so: Why is there the warning mentioned above where setting the batch_size=sample_size leads to the failed training?
Or is it showing the available samples to the network and filling up the rest with zeros to match the given batch_size?
If so: Why is there the necessity of masking when using e.g. stateful mode?
edit:
So, I tried some additional workarounds and built my own Data Generator, which proves data of one patient as one batch. I then set steps_per_epoch=len(train_patients) to include all patients into one epoch. No warnings about retracing, which I do not understand either.
It seems to solve my problem of showing one patient per batch without mixing patient data and have a variable sample_size, but I really do not understand the differences between all these posibilities and their different warnings.
I am training a neural network by SGD (batch size = 1). The inputs are randomly generated, and the labels are calculated based on the input. AKA the data does not have to be realistic, but the relationships between inputs and labels are specific. I will train my NN only 1 epoch, but with many batches.
I have the following code:
training_input = tf.Variable(tf.zeros(...))
assign_training_input_with_random_values = training_input.assign(tf.random_normal(...))
//Create a session, initialize a bunch of variables, construct a neural network...
for batch in range(batch_number):
sess.run(assign_training_input_with_random_values)
//Train my neural network...
However I noticed that if I write the above code differently the speed goes down by a lot:
//Run the assignment operation directly without defining it as a variable
for batch in range(batch_size)
sess.run(training_input.assign(tf.random_normal(...)))
//Train my neural network...
The first snippet being significantly faster makes me worry that tensorflow is only randomizing when I define the assign_training_input_with_random_values variable, and the same training examples are fed to the NN over every batch afterwards. In this case, the NN will probably not generalize well. Meanwhile, the second snippet is slow because it is randomizing every batch. Is this actually the case or is there another reason for this?
First the explanation to your observations
Computational difference between 1st and 2nd solutions
It makes sense that your first solution is faster than the second. You define the assign operation once and then execute that for 100 epochs. However in the 2nd solution you create an op every epoch, growing the computational graph over time which causes your program to slow down.
Observation about the 1st solution
(After #Y.Z.'s finding) Apparently the first solution does evaluate to different random number arrays every time you run it. Therefore, the first solution is also valid.
Another way to implement this
The correct way to implement your solution would be to use a tf.placeholder to feed values in every epoch the following way.
import tensorflow as tf
import numpy as np
training_input = tf.Variable(tf.zeros(shape=[3, 2]))
tf_random = tf.placeholder(shape=[3, 2], dtype=tf.float32)
assign_training_input_with_random_values = training_input.assign(tf_random)
#Create a session, initialize a bunch of variables, construct a neural network...
epoch=0
with tf.Session() as sess:
while epoch < 10:
epoch+= 1
sess.run(assign_training_input_with_random_values, feed_dict={tf_random:np.random.normal(size=(3,2))})
Comparing Solution 1 vs My solution
So turns out, both your first solution and my solution will not grow the graph. If you run the line
print([n.name for n in tf.get_default_graph().as_graph_def().node])
for your first solution and my solution (Be careful to run tf.reset_default_graph() at the beginning) you'll see that the number of tensors remain constant regardless of the number of iterations. Appears that TensorFlow is smart enough to prune those old tf.random tensors no longer used.
So I know how epochs, train steps, batch sizes and this kind of stuff are defined, but it is really hard to me to get my head wraped around the TPU terminology like train loops, iterations per loop and so on. I read this but Im still confused.
Also how can I benchmark the time for iterations per loop for example.
Any explanation would help me a lot there. Thanks!
As the other answers have described, iterations_per_loop is a tuning parameter that controls the amount of work done by the TPU before checking in with it again. A lower number lets you inspect results (and benchmark them) more often, and a higher number reduces the overhead due to synchronization.
This is no different from familiar network or file buffering techniques; changing its value affects performance, but not your final result. In contrast, ML hyperparameters like num_epochs, train_steps, or train_batch_size will change your result.
EDIT: Adding an illustration in pseudocode, below. Notionally, the training loop functions like this:
def process_on_TPU(examples, train_batch_size, iterations_per_loop):
# The TPU will run `iterations_per_loop` training iterations before returning to the host
for i in range(0, iterations_per_loop):
# on every iteration, the TPU will compute `train_batch_size` examples,
# calculating the gradient from every example in the given batch
compute(examples[i * train_batch_size : (i + 1) * train_batch_size])
# assume each entry in `example` is a single training example
for b in range(0, train_steps, train_batch_size * iterations_per_loop)
process_on_TPU(examples[b:b + train_batch_size * iterations_per_loop],
train_batch_size,
iterations_per_loop)
From this, it might appear that train_batch_size and iterations_per_loop are simply two different ways of accomplishing the same thing. However, this is not the case; train_batch_size affects the learning rate, since (at least in ResNet-50) the gradient is computed at each iteration from the average of the gradient of every example in the batch. Taking 50 steps per 50k examples will produce a different result from taking from 1k steps per 50k examples, since the latter case calculates the gradient much more often.
EDIT 2: Below is a way to visualize what's happening, with a racing metaphor. Think of the TPU as running a race that has a distance of train_steps examples, and its stride lets it cover a batch of examples per step. The race is on a track, which is shorter than the total race distance; the length of the lap is your total number of training examples, and every lap around the track is one epoch. You can think of iterations_per_loop as being the point where the TPU can stop at a "water station" of sorts where the training is temporarily paused for a variety of tasks (benchmarking, checkpointing, other housekeeping).
By "train loop", I'm assuming it's the same meaning as "training loop". The training loop is the one that iterates through each epoch in order to feed the model.
The iterations per loop is related to how Cloud TPU handles the training loop. In order to amortize the TPU launch cost, the model training step is wrapped in a tf.while_loop, such that one Session run actually runs many iterations for a single training loop.
Because of this, Cloud TPU runs a specified number of iterations of the training loop before returning to the host. Therefore, iterations_per_loop is how many iterations will run for one session.run call.
TPU literally means "Tensor Processing Unit", it's a hardware device used for computation in exactly the same way a GPU is used. The TPUs are effectively Google proprietary GPUs. There are technical differences under the hood of a GPU vs a TPU, mostly regarding speed and power consumption, and some issues of floating point precision, but you don't need to care about the details.
iterations_per_loop appears to be an effort to improve efficiency by loading the TPU with multiple training batches. There are often hardware bandwidth limitations when transferring large amounts of data from main memory to a GPU/TPU.
It appears that the code you reference is passing iterations_per_loop number of training batches to the TPU, then running iterations_per_loop number of training steps before pausing to do another data transfer from main memory to TPU memory.
I'm rather surprised to see that though, I would expect that asynchronous background data transfers would be possible by now.
My only disclaimer is that, while I'm proficient with Tensorflow, and have watched TPU evolution in papers and articles, I'm not directly experienced with the Google API or running on TPUs, so I'm inferring from what I read in the documentation you linked to.
I am using tensorflow, but I am not sure why I even need the global_step variable or if it is even necessary for training. I have sth like this:
gradients_and_vars = optimizer.compute_gradients(value)
train_op = optimizer.apply_gradients(gradients_and_vars)
and then in my loop inside a session I do this:
_ = sess.run([train_op])
I am using a Queue to feed my data the the graph. Do I even have to instantiate a global_step variable?
My loop looks like this:
while not coord.should_stop():
So this loop stops, when it should stop. So why do I need the global_step at all?
You don't need the global step in all cases. But sometimes people want to stop training, tweak some code and then continue training with the saved and restored model. Then often it is nice to know how long (=for how many time steps) this model had been trained so far. Thus the global step.
Also sometimes your learning rate regime might depend on the time the model already had been trained. Say you want to decay your learning rate every 100.000 steps. If you don't keep track of the number of steps already taken this might be difficult if you interrupted training in between and didn't keep track of the number of steps already taken.
And furthermore if you are using tensorboard the global step is the central parameter for your x-axis of the charts.