batch size in model.fit and model.predict - tensorflow

In keras, both model.fit and model.predict has a parameter of batch_size. My understanding is that batch size in model.fit is related to batch optimization, what's the physical meaning of batch_size in model_predict? Does it need to be equal to the one used by model.fit?

No it doesn‘t. Imagine inside your model there is a function which increases the amount of memory significantly. Therefore, you might run into resource errors if you try to predict all your data in one go. This is often the case when you use gpu with limited gpu memory for predicting. So instead you choose to predict only small batches at the same time. The batch_size parameter in the predict function will not alter your results in any way. So you can choose any batch_size you want for prediction.

It depends on your model and whether the batch size when training must match the batch size when predicting. For example, if you're using a stateful LSTM then the batch size matters because the entire sequence of data is spread across multiple batches, i.e. it's one long sequence that transcends the batches. In that case the batch size used to predict should match the batch size when training because it's important they match in order to define the whole length of the sequence. In stateless LSTM, or regular feed-forward perceptron models the batch size doesn't need to match, and you actually don't need to specify it for predict().
Just to add; this is different to train_on_batch() where you can supply a batch of input samples and get an equal number of prediction outputs. So, if you create a batch of 100 samples, you submit to train_on_batch() then you get 100 predictions, i.e. one for each sample. This can have performance benefits over issuing one at a time to predict().

As said above, batch size just increases the number of training data that is fed in at one go(batches). Increasing it may increase chances of your computer resources running out, assuming you are running it on your personal computer. If you are running it on the cloud with higher resources, you should be fine. You can toggle the number as you want, but don't put in a big number, I suggest going up slowly. Also, you may want to read this before you increase your batch size:
https://stats.stackexchange.com/questions/164876/tradeoff-batch-size-vs-number-of-iterations-to-train-a-neural-network

Related

Batch normalisation during testing

I am working on a 2d time series problem with vector size 140*6 for binary classification using CNN. I have not used any scaling and normalising techniques instead directly fed data to CNN with 3 hidden layers and Batch Normalisation layers with batch size 256 during training .Since I have to test it at real time as well with batch size 1 how would batch Normalisation work then having not calculated any mean or std deviation for any training layer.And also should batch normalisation later be used for forward pass during final testing or the mean and std deviation only should be calculated for training layers and used.
Batch normalization is not used during testing. The reason for that being is batch normalization is used to alleviate the problem of covariance shift between different batches in training data. The covariance shift leads to bad models getting trained, thus, we use it. It has no role to play during testing.
And if you have used batch normalization with batch size 1, then, that is simply instance normalization.
This questions has been asked two years ago but I don't think the accepted answer is correct! Batch Normalization IS is used during testing (at least you keep the batch normalisation LAYERS), but with the training data's saved running averages of mean and variance. So it is not actual batch normalisation during testing but rather a linear transformation with the saved training statistics. Therefore, if you are testing with batch size of 1 you would just use the saved running averages of the training data.
The following thread answers the question: Batch normalization during testing

Why does Keras accept a batch size option for model.evaluate?

Why does the evaluate function of the Keras API in Tensorflow accept a batch_size? To my knowledge, this parameter should only be relevant for managing how many samples we use per iteration during training. What influence does this choice have during model evaluation?
Batch size is mainly used in Sequence-based predictions or in Time series predictions.
Below are the cases where you have to use batch size while prediction.
In Time Series use cases it may be desirable to use a large batch size when training the network and a batch size of 1 when making predictions in order to predict the next step in the sequence.
For Stateful RNN it is required to provide a fixed batch size during prediction/evaluation where the output state of the current batch is used as the initial state for the next batch. They keep information from one batch to another batch.
If your model doesn't fall into these kinds of category technically you don't need to provide batch size as input during evaluating. Even if you provide batch size, it's how much data you are feeding at a time for GPU.

Large trainable embedding layer slows down training

I am training a network to classify text with a LSTM. I use a randomly initialized and trainable embedding layer for the word inputs. The network is trained with the Adam Optimizer and the words are fed into the network with a one-hot-encoding.
I noticed that the number of words which are represented in the embedding layer influences heavily the training time, but I don't understand why. Increasing the number of words in the network from 200'000 to 2'000'000 almost doubled the time for a training epoch.
Shouldn't the training only update weights which where used during the prediction of the current data point. Thus if my input sequence has always the same length, there should always happen the same number of updates, regardless of the size of the embedding layer.
The number of updates needed would be reflected in the number of epochs it takes to reach a certain precision.
If your observation is that convergence takes the same number of epochs, but each epoch takes twice as much wall clock time, then it's an indication that simply performing the embedding lookup (and writing the update of embedding table) now takes a significant part of your training time.
Which could easily be the case. 2'000'000 words times 4 bytes per float32 times the length of your embedding vector (what is it? let's assume 200) is something like 1.6 gigabytes of data that needs to be touched every minibatch. You're also not saying how you're training this (CPU, GPU, what GPU) which has a meaningful impact on how this should turn out because of e.g. cache effects, as for CPU doing the exact same number of reads/writes in a slightly less cache-friendly manner (more sparsity) can easily double the execution time.
Also, your premise is a bit unusual. How much labeled data do you have that would have enough examples of the #2000000th rarest word to calculate a meaningful embedding directly? It's probably possible, but would be unusual, in pretty much all datasets, including very large ones, the #2000000th word would be a nonce and thus it'd be harmful to include it in trainable embeddings. The usual scenario would be to calculate large embeddings separately from large unlabeled data and use that as a fixed untrainable layer, and possibly concatenate them with small trainable embeddings from labeled data to capture things like domain-specific terminology.
If I understand correctly, your network takes one-hot vectors representing words to embeddings of some size embedding_size. Then the embeddings are fed as input to an LSTM. The trainable variables of the network are both those of the embedding layer and the LSTM itself.
You are correct regarding the update of the weights in the embedding layer. However, the number of weights in one LSTM cell depends on the size of the embedding. If you look for example at the equation for the forget gate of the t-th cell,
you can see that the matrix of weights W_f is multiplied by the input x_t, meaning that one of the dimensions of W_f must be exactly embedding_size. So as embedding_size grows, so does the network size, so it takes longer to train.

Does batch normalization work on balanced dataset?

I trained a classification network using tensorFlow with batch normalization in every convolutional layer. When I predict on a balanced test set where every category included in it, the accuracy is normal. However, if I chose any one specific category from test set, the accuracy is low, even zero.
But when 3 categories included in test set, the accuracy became higher. As we all know, the weights was fixed when the model finished training. But I find the balance in test set have greatly influence on prediction accuracy.
I think if batch normalization has influence on this, so I remove all batch normalization and retrained the model again. This time, when I predict only one category picture, it became normal.
Could anyone know why? THANKS!
You're right. If your training set is unbalanced you compute and accumulate mean values (for every layer) that are skewed in favor of the majority class.
In fact, you're not "normalizing" but instead, you're making the unbalancing problem worse.
Use batch normalization when you have a balanced training set and you can be sure that your batches will contain a balanced number of samples. This gives you optimal results.
However, since you added in the comments that you're using tf.contrib.layers.conv2d(x, num_output, kernel_size, stride, padding, activation_fn, normal_fn=tf.contrib.layers.batch_norm)
I spotted the problem: normalizer_fn calls the function you pass (batch_norm). But it uses the defaults parameters. By default, is_training equals to True thus you're computing even during the test phase the mean and the variance over the batch. Just read carefully the documentation of tf.contrib.layers.conv2d and use normalizer_params to pass is_training=True when training and is_training=False when testing/validating.

How to partition a single batch into many invocations to save memory

I have a somewhat big model, that can only be trained on GPU with a small batch size, but I need to use a larger batch size (from other experiments, I know this improves final accuracy and convergence time)
Caffe provides a nice solution to this problem through the 'iter_size' option, which splits a batch into n smaller batches, accumulate n gradients then update once
how can this be implemented efficiently in TensorFlow ?
You could use smaller batches, compute the gradients manually, and then add them up and apply them at once. For example, if you want a batch size of 100, compute gradients for 10 batches of 10, then add the gradients and apply them. This is explained here.
You can use the tf.gradients() op to compute the gradients for each batch separately and add them. Then use the apply_gradients() method on whatever optimizer you want to perform the training step.