How to partition a single batch into many invocations to save memory - tensorflow

I have a somewhat big model, that can only be trained on GPU with a small batch size, but I need to use a larger batch size (from other experiments, I know this improves final accuracy and convergence time)
Caffe provides a nice solution to this problem through the 'iter_size' option, which splits a batch into n smaller batches, accumulate n gradients then update once
how can this be implemented efficiently in TensorFlow ?

You could use smaller batches, compute the gradients manually, and then add them up and apply them at once. For example, if you want a batch size of 100, compute gradients for 10 batches of 10, then add the gradients and apply them. This is explained here.
You can use the tf.gradients() op to compute the gradients for each batch separately and add them. Then use the apply_gradients() method on whatever optimizer you want to perform the training step.

Related

Why tensorflow dataset neet to be batched before fit?

If we've made tensorflow dataset (for example from from_tensor_slices) we need to use .batch(...) method before we set this dataset as parameter of function fit(). Question is why method "fit" expect dataset to be batched ?
Datasets are sliced or batched for following reasons.
To avoid high memory usage if entire Dataset is used as one batch ( which might create Out of Memory problems)
Computation is faster when Training is done in Batches
Weight and Bias update is possible with respect to labels when training is done is batches.
Reference:
https://medium.com/#elimu.michael9/understanding-epochs-and-batches-23120a04b3cb

Why is batch size allocated in GPU?

Given a Keras model (on Colab) that has input shape (None,256,256,3) and batch_size is 16 then the memory allocated for that input shape is 16*256*256*3*datatype (datatype=2,4,8 depending on float16/32/64). This is how it works. My confusion is that for a given batch_size (=16) 1*256*256*3 could have been allocated and the 16 images could have been passed one by one and the final gradient could have been averaged.
1) So, is the allocation dependent on batch size so that 'batch_size' computations can be done in parallel and the configuration that I have mentioned above (1*256*256*3) would be serializing and hence defeating the purpose of GPU?
2) Would the same type of allocation happen on CPU for parallel computation (if the answer to 1) is yes)?
In general batch size is what you need to tune-up.
And as for your query batch size is data-dependent, and as you use batches, you are generally running a generator object, which loads data in batches, perform GD and then move on next.
It is preferred to use batch gradient decent as it converges faster than GD
Also as you increase batch size, so more no of training no of examples will be loaded, increasing memory allocation,
Yes you can use parallel computation for training large batches but overall you are doing same, as you are actually calculating whole batches each time which you are doing in genral batch computation
CPU should have cores, Then Yes, Else You Need GPU as Computing Requires A lOt of powers Because all you are doing under the hood is working with n dimensional matrices, calculating partial derivatives and then calculating square loss and further updating weights values

Why does Keras accept a batch size option for model.evaluate?

Why does the evaluate function of the Keras API in Tensorflow accept a batch_size? To my knowledge, this parameter should only be relevant for managing how many samples we use per iteration during training. What influence does this choice have during model evaluation?
Batch size is mainly used in Sequence-based predictions or in Time series predictions.
Below are the cases where you have to use batch size while prediction.
In Time Series use cases it may be desirable to use a large batch size when training the network and a batch size of 1 when making predictions in order to predict the next step in the sequence.
For Stateful RNN it is required to provide a fixed batch size during prediction/evaluation where the output state of the current batch is used as the initial state for the next batch. They keep information from one batch to another batch.
If your model doesn't fall into these kinds of category technically you don't need to provide batch size as input during evaluating. Even if you provide batch size, it's how much data you are feeding at a time for GPU.

batch size in model.fit and model.predict

In keras, both model.fit and model.predict has a parameter of batch_size. My understanding is that batch size in model.fit is related to batch optimization, what's the physical meaning of batch_size in model_predict? Does it need to be equal to the one used by model.fit?
No it doesn‘t. Imagine inside your model there is a function which increases the amount of memory significantly. Therefore, you might run into resource errors if you try to predict all your data in one go. This is often the case when you use gpu with limited gpu memory for predicting. So instead you choose to predict only small batches at the same time. The batch_size parameter in the predict function will not alter your results in any way. So you can choose any batch_size you want for prediction.
It depends on your model and whether the batch size when training must match the batch size when predicting. For example, if you're using a stateful LSTM then the batch size matters because the entire sequence of data is spread across multiple batches, i.e. it's one long sequence that transcends the batches. In that case the batch size used to predict should match the batch size when training because it's important they match in order to define the whole length of the sequence. In stateless LSTM, or regular feed-forward perceptron models the batch size doesn't need to match, and you actually don't need to specify it for predict().
Just to add; this is different to train_on_batch() where you can supply a batch of input samples and get an equal number of prediction outputs. So, if you create a batch of 100 samples, you submit to train_on_batch() then you get 100 predictions, i.e. one for each sample. This can have performance benefits over issuing one at a time to predict().
As said above, batch size just increases the number of training data that is fed in at one go(batches). Increasing it may increase chances of your computer resources running out, assuming you are running it on your personal computer. If you are running it on the cloud with higher resources, you should be fine. You can toggle the number as you want, but don't put in a big number, I suggest going up slowly. Also, you may want to read this before you increase your batch size:
https://stats.stackexchange.com/questions/164876/tradeoff-batch-size-vs-number-of-iterations-to-train-a-neural-network

Tensorflow optimizers: loss sum vs mean

I'm wondering if the Tensorflow optimizers (in particular the AdamOptimizer) have a preference when it comes to defining a loss function as a sum or as a mean/average over a minibatch?
In general my assumption was that using the mean is preferred, because the loss does not depend with the size of the mini batches. Thus, it is easier to find a learning rate which works with any batch size.
However, Tensorflow defines e.g. l2_loss internally as:
output = sum(t ** 2) / 2
Does this imply that the optimizers account for the batch size internally already, i.e., they expect losses to scale linearly with the batch size? Also, what's the motivation of taking half the L2 norm from the perspective of optimization?
Well here l2_loss is actually a regularization loss function. We add that inside our main loss functions inorder to prevent the parameters from over fitting. We normally divide the l2 loss by 2 inorder to make it easy when taking the gradients.
And inside any optimizer we take the average loss w.r.t batch size.