TensorFlow Increase Batch Size mid experiment - tensorflow

I am looking to replicate some of the behavior in this paper "don't decay the learning rate, increase the batch size" and I am wondering if there is a simple approach to increase the batch size within a GCMLE experiment. I have a custom estimator and I am trying to think of any ways to adjust the batch size within the experiment. I realize that I could run with one batch size for a certain number of epochs and then load this saved graph and kick off a subsequent experiment, but I am wondering if there are any other options to update the batch size within the same experiment?

Setting up your graph to support a variable batch size is pretty easy, just use a None in the shape of the first dimension. Take a look at this article:
Build a graph that works with variable batch size using Tensorflow
Then you feed in any size batch at every sess.run(train_op, feed_dict=[X:data, Y:labels]) where the first dimension of X, your batch, is variable length.
It pretty much just works as you'd expect.
Example graph structure wiht variable batch size:
X = tf.placeholder("float", [None, num_input])
Y = tf.placeholder("float", [None, num_classes])
In general, you're allowed to have 1 unknown dimension in your tensors. Tensorflow will infer that dimension based on the actual data you pass it at runtime.
In this example, in your first iterations your data shape might be [10, 784] (batches of 10), and in later iterations maybe your shape becomes [50, 784] (batches of 50). The rest of your graph setup will work without change.

One approach is to set delay_workers_by_global_step=True in the constructor to Experiment.
The reason this works is because the effective batch size is batch_size * num_workers. So if you delay the start of other workers, your batch size will gradually increase.
Of course, your throughput will be correspondingly lower in the early phases.
If you directly want to control the batch_size, you will have to effectively replicate the behavior of learn_runner.run in your own code. That wouldn't be too bad, except for the fact that deep down in experiment.py, it starts a server which, AFAICT, cannot be disabled.

Related

Loss reduction in tf.distributed.MirroredStrategy()

I'm confused regarding using distributed strategy and the correct way of reduction in loss functions.
I implemented a U-Net using tf.distribute.MirroredStrategy(). Everything works fine using default loss BinaryCrossentropy as follows:
with strategy.scope():
model = build_network((size, size, 3), num_classes)
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=args.learning_rate),
loss=tf.keras.losses.BinaryCrossentropy()])
However, I want to create custom loss functions. To start with, I wrote a wrapper containing BinaryCrossentropy, to get familiar with the correct way of using the reduction methods. I followed the instructions in https://www.tensorflow.org/tutorials/distribute/custom_training#define_the_loss_function
and used tf.nn.compute_average_loss in order to divide by the global batch_size.
def loss_functions(loss_spec):
if loss_spec == 'cross_entropy':
def c_loss(truth, pred):
my_loss = tf.keras.losses.BinaryCrossentropy(reduction=tf.keras.losses.Reduction.NONE)(truth, pred)
my_loss = tf.math.reduce_mean(my_loss, axis=[1, 2]) # to compute average across the two image dimensions
my_loss = tf.nn.compute_average_loss(my_loss) # sums up all items and divides by batch size
return my_loss
return c_loss
which is called in the following way:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=args.learning_rate),
loss=utils.loss_functions('cross_entropy')])
It also works, but I realised a difference of factor of number or replica compared to using tf.keras.losses.BinaryCrossentropy(). I.e., when using two kernels, using BinaryCrossentropy() directly yields a loss twice as large as my custom loss. Thus, to geht the same, I would need to divide by the batch size per replica instead of global batch size, i.e., the way it should NOT be done according to the documentation.
However, the documentation refers to building an own training routine, whereas I am using model.compile() and model.fit() methods.
Can anybody explain this behaviour to me?
UPDATE:
The use of tf.nn.compute_average_loss or the use of any reduction on the batch axis is not needed when using model.compile() and model.fit() at all - the reduction and scaling is done automatically. However, I still do not know how model.fit() does work internally.
Thanks and cheers, everybody

Validation Loss doesnt Change

i am using a custom loss trying to decrease peak average power ratio of ofdm symbols. to break it down the input is of length N length that can take only 4 values. the output can take any floating value from [-1,1] (because i cant go over the power threshold). i generate the training and validation set randomly since it is.. the data can take any random combination of the 4 values.
The problem is changing and tweaking the model and parameters only improves the training loss, validation loss is constant from the first epoch.
I am using a custom loss function that only concatenates the output of the model and spreads it in the middle of the input and using ifft operation then calculating the max / mean of all elements.
in short its reserving some of the array elements (tones) to pick so that it removes the peaks of the input sacrificing those element but getting less peaks in the final signal.
i am sending input data as one hot encoded for each of the 4 values, and sending them once more as labels in their complex form so i can do operations on them in the custom loss function below.
def PAPR_Loss(y_true, y_pred):
Reserved_phases = [0, 32, 62, 93, 124, 155, 186, 217, 248]
data = tf.concat([tf.concat([y_true[:, Reserved_phases[i]:Reserved_phases[i+1]], tf.complex(y_pred[:, 4*(i+1)-4] - y_pred[:, 4*(i+1)-2], y_pred[:, 4*(i+1)-3] - y_pred[:, 4*(i+1)-1])[:, tf.newaxis]], 1) for i in range(L)], 1)
x = tf.signal.ifft(data)
temp = tf.square(tf.abs(x))
loss = tf.reduce_max(temp, axis=-1) / tf.reduce_mean(temp, axis=-1)
return 10*tf.experimental.numpy.log10(loss)
Loss and Validation Loss vs Epochs
i am using 80k unique data combinations as training and 20k different combinations as validation
also i am using dropout after each layer so i dont think its an overfitting problem.
when i remove the tanh activation at the output (meaning the output can take any values) i start getting improvements on the validation and better loss on training as well but i suspect this occurs because we just let the model add the mean power term which is inversly proportional to the loss but it doesnt learn where the peaks and how to cancel those peaks. it just increase the mean as much as possible so that the max isnt that big in relation to it anymore.
also could the model not train because of the concatenation and using input in a different form as a label? i thought i could get away with this since the input isnt trainable so it doesnt matter.
Note: The model doesnt even beat the classical method without using deep learning which just search in a candidate limited set for the best combinations that decrease this peaks. the problem with the classical model that it is computationally expensive if i can even match this performance this approach will be very rewarding.
what could be going wrong here? what can i try changing next?
Thanks in advance.

batch size in model.fit and model.predict

In keras, both model.fit and model.predict has a parameter of batch_size. My understanding is that batch size in model.fit is related to batch optimization, what's the physical meaning of batch_size in model_predict? Does it need to be equal to the one used by model.fit?
No it doesn‘t. Imagine inside your model there is a function which increases the amount of memory significantly. Therefore, you might run into resource errors if you try to predict all your data in one go. This is often the case when you use gpu with limited gpu memory for predicting. So instead you choose to predict only small batches at the same time. The batch_size parameter in the predict function will not alter your results in any way. So you can choose any batch_size you want for prediction.
It depends on your model and whether the batch size when training must match the batch size when predicting. For example, if you're using a stateful LSTM then the batch size matters because the entire sequence of data is spread across multiple batches, i.e. it's one long sequence that transcends the batches. In that case the batch size used to predict should match the batch size when training because it's important they match in order to define the whole length of the sequence. In stateless LSTM, or regular feed-forward perceptron models the batch size doesn't need to match, and you actually don't need to specify it for predict().
Just to add; this is different to train_on_batch() where you can supply a batch of input samples and get an equal number of prediction outputs. So, if you create a batch of 100 samples, you submit to train_on_batch() then you get 100 predictions, i.e. one for each sample. This can have performance benefits over issuing one at a time to predict().
As said above, batch size just increases the number of training data that is fed in at one go(batches). Increasing it may increase chances of your computer resources running out, assuming you are running it on your personal computer. If you are running it on the cloud with higher resources, you should be fine. You can toggle the number as you want, but don't put in a big number, I suggest going up slowly. Also, you may want to read this before you increase your batch size:
https://stats.stackexchange.com/questions/164876/tradeoff-batch-size-vs-number-of-iterations-to-train-a-neural-network

fit_generator in keras: where is the batch_size specified?

Hi I don't understand the keras fit_generator docs.
I hope my confusion is rational.
There is a batch_size and also the concept of training in in batches. Using model_fit(), I specify a batch_size of 128.
To me this means that my dataset will be fed in 128 samples at a time, thereby greatly alleviating memory. It should allow a 100 million sample dataset to be trained as long as I have got the time to wait. After all, keras is only "working with" 128 samples at a time. Right?
But I highly suspect that for specifying the batch_size alone doesn't do what I want whatsoever. Tons of memory is still being used. For my goals I need to train in batches of 128 examples each.
So I am guessing this is what fit_generator does. I really want to ask why doesn't batch_size actually work as it's name suggests?
More importantly, if fit_generator is needed, where do I specify the batch_size? The docs say to loop indefinitely.
A generator loops over every row once. How do I loop over 128 samples at a time and remember where I last stopped and recall it the next time that keras asks for the next batch's starting row number (would be row 129 after first batch is done).
You will need to handle the batch size somehow inside the generator. Here is an example to generate random batches:
import numpy as np
data = np.arange(100)
data_lab = data%2
wholeData = np.array([data, data_lab])
wholeData = wholeData.T
def data_generator(all_data, batch_size = 20):
while True:
idx = np.random.randint(len(all_data), size=batch_size)
# Assuming the last column contains labels
batch_x = all_data[idx, :-1]
batch_y = all_data[idx, -1]
# Return a tuple of (Xs,Ys) to feed the model
yield(batch_x, batch_y)
print([x for x in data_generator(wholeData)])
First, keras batch_size does work very well. If you are working on GPU, you should know that the model can be very heavy with keras, especially if you are using recurrent cells. If you are working on CPU, the whole program is loaded in memory, the batch size won't have much of an impact on the memory. If you are using fit(), the whole dataset is probably loaded in memory, keras produces batches at every step. It's very difficult to predict the amount of memory that will be used.
As for the fit_generator() method, you should build a python generator function (using yield instead of return), yielding one batch at every step. The yield should be in an infinite loop (we often use while true: ...).
Do you have some code to illustrate your problem?

What is the best way to run saved model with different batch size in TensorFlow?

I trained Cifar10 example model from TensorFlow's repository with batch_size 128 and it worked fine. Then I froze graph and managed to run it with C++ just like they do it in their C++ label image example.
The only problem was that I had to artificially generate tensor of shape [128, image_height, image_width, channels] to classify single image with C++ because saved model expects input of 128 samples in a batch since that is number of samples that comes from queue.
I tried training Cifar10 example with batch_size = 1 and then I managed to classify examples one by one when I run model with C++, but that doesn't seem like a great solution. I also tried manually changing tensor shapes in saved graph file but it didn't work.
My question is what is the best way to train model with fixed batch size (like 32, 64, 128 etc.) and then save model so that it can be used with batch size of arbitrary length? If that's not possible, then how to save model to be able to classify samples one by one.
It sounds like the problem is that TensorFlow is "baking in" the batch size to other tensors in the graph (e.g. if the graph contains tf.shape(t) for some tensor t whose shape depends on the batch size, the batch size might be stored in the graph as a constant). The solution is to change your program slightly so that tf.train.batch() returns tensors with a variable batch size.
The tf.train.batch() method accepts a tf.Tensor for the batch_size argument. Perhaps the simplest way to modify your program for variable-sized batches would be to define a placeholder for the batch size:
# Define a scalar tensor for the batch size, so that you can alter it at
# Session.run()-time.
batch_size_tensor = tf.placeholder(tf.int32, shape=[])
input_tensors = tf.train.batch(..., batch_size=batch_size_tensor, ...)
This would prevent the batch size from being baked into your GraphDef, so you should be able to feed values of any batch size in C++. However, this modification would require you to feed a value for the batch size on every step, which is slightly tedious.
Assuming that you always want to train with batch size 128, but retain the flexibility to change the batch size later, you could use a tf.placeholder_with_default() to specify that the batch size should be 128 when you don't feed an alternative value:
# Define a scalar tensor for the batch size, so that you can alter it at
# Session.run()-time.
batch_size_tensor = tf.placeholder_with_default(128, shape=[])
input_tensors = tf.train.batch(..., batch_size=batch_size_tensor, ...)
Is there a reason you need fixed batch size in the graph?
I think a good way is to build a graph with a variable batch size - by putting None as the first dimension. During training, you can then pass the batch size flag to your data provider, so it feeds the desired amount of data in each iteration.
After the model is trained, you can export the graph using tf.train.Saver(), which exports the metagraph. To do inference, you can load the exported files and just evaluate with any number of examples - also just one.
Note, this is different from the frozen graph.