Keras and Recurrent Neural Networks: What does the input look like for batches of univariate data - batch-processing

When training an RNN, often times we specify batches to increase training speed. But what does the input look like for input data? E.g., suppose I have 1000 observations of just temperatures and determine batches of size 100. And let's say we have 4 nodes in a hidden layer.
Is the input simply each temperature? I.e. send temperature 1 through the model, then temperature 2, the temperature 3, etc. until we get to 100, then update weights.
Or are all of the temperatures in the batch positioned as inputs and sent through the model? I.e., we apply the activation function to the whole batch in each node.
I hope this makes sense. Please let me know if you need clarification or more information.

Related

how does batch size work in TimeDistributed

I'm beginner of AI and trying to implement the CRNN model in Keras.
model.add(TimeDistributed(base_model, input_shape=(3,32,32,3)))
I understand that the above code creates 3 timesteps and uses a 32x32 RGB image.
Then, if I have 90 train_image and set batch size to 30, how does it work?
grouped 30 pieces and entered into the timestep
or
enter into timestep in order
or am I misunderstanding about batch size?
If you have 90 images and you want a batch size of 30, then your input_shape should be (90,30,32,32,3). Source: the docs https://keras.io/api/layers/recurrent_layers/time_distributed/
Batch size is the amount of samples you want to use in an iteration of learning, before your CRNN updates its internal parameters.
As you can see in your screenshots you’ve trained your model for one epoch, in 3 timesteps (when training your learning model, an epoch is one iteration through the entire dataset. 30 times 3 makes 90, the whole dataset).

Why does Keras accept a batch size option for model.evaluate?

Why does the evaluate function of the Keras API in Tensorflow accept a batch_size? To my knowledge, this parameter should only be relevant for managing how many samples we use per iteration during training. What influence does this choice have during model evaluation?
Batch size is mainly used in Sequence-based predictions or in Time series predictions.
Below are the cases where you have to use batch size while prediction.
In Time Series use cases it may be desirable to use a large batch size when training the network and a batch size of 1 when making predictions in order to predict the next step in the sequence.
For Stateful RNN it is required to provide a fixed batch size during prediction/evaluation where the output state of the current batch is used as the initial state for the next batch. They keep information from one batch to another batch.
If your model doesn't fall into these kinds of category technically you don't need to provide batch size as input during evaluating. Even if you provide batch size, it's how much data you are feeding at a time for GPU.

Large trainable embedding layer slows down training

I am training a network to classify text with a LSTM. I use a randomly initialized and trainable embedding layer for the word inputs. The network is trained with the Adam Optimizer and the words are fed into the network with a one-hot-encoding.
I noticed that the number of words which are represented in the embedding layer influences heavily the training time, but I don't understand why. Increasing the number of words in the network from 200'000 to 2'000'000 almost doubled the time for a training epoch.
Shouldn't the training only update weights which where used during the prediction of the current data point. Thus if my input sequence has always the same length, there should always happen the same number of updates, regardless of the size of the embedding layer.
The number of updates needed would be reflected in the number of epochs it takes to reach a certain precision.
If your observation is that convergence takes the same number of epochs, but each epoch takes twice as much wall clock time, then it's an indication that simply performing the embedding lookup (and writing the update of embedding table) now takes a significant part of your training time.
Which could easily be the case. 2'000'000 words times 4 bytes per float32 times the length of your embedding vector (what is it? let's assume 200) is something like 1.6 gigabytes of data that needs to be touched every minibatch. You're also not saying how you're training this (CPU, GPU, what GPU) which has a meaningful impact on how this should turn out because of e.g. cache effects, as for CPU doing the exact same number of reads/writes in a slightly less cache-friendly manner (more sparsity) can easily double the execution time.
Also, your premise is a bit unusual. How much labeled data do you have that would have enough examples of the #2000000th rarest word to calculate a meaningful embedding directly? It's probably possible, but would be unusual, in pretty much all datasets, including very large ones, the #2000000th word would be a nonce and thus it'd be harmful to include it in trainable embeddings. The usual scenario would be to calculate large embeddings separately from large unlabeled data and use that as a fixed untrainable layer, and possibly concatenate them with small trainable embeddings from labeled data to capture things like domain-specific terminology.
If I understand correctly, your network takes one-hot vectors representing words to embeddings of some size embedding_size. Then the embeddings are fed as input to an LSTM. The trainable variables of the network are both those of the embedding layer and the LSTM itself.
You are correct regarding the update of the weights in the embedding layer. However, the number of weights in one LSTM cell depends on the size of the embedding. If you look for example at the equation for the forget gate of the t-th cell,
you can see that the matrix of weights W_f is multiplied by the input x_t, meaning that one of the dimensions of W_f must be exactly embedding_size. So as embedding_size grows, so does the network size, so it takes longer to train.

Convolutional Neural Network Training

I have a question regarding convolutional neural network (CNN) training.
I have managed to train a network using tensorflow that takes an input image (1600 pixels) and output one of three classes that matches it.
Testing the network with variations of the trained classes is giving good results. However; when I give it a different -fourth- image (does not contain any of the trained 3 image), it always returns a random match to one of the classes.
My question is, how can I train a network to classify that the image does not belong to either of the three trained images? A similar example, if i trained a network against the mnist database and then a gave it the character "A" or "B". Is there a way to discriminate that the input does not belong to either of the classes?
Thank you
Your model will always make predictions like your labels, so for example if you train your model with MNIST data, when you will make predictions, prediction will always be 0-9 just like MNIST labels.
What you can do is train a different model first with 2 classes in which you will predict if an image belongs to data set A or B. E.x. for MNIST data you label all data as 1 and add data from other sources that are different (not 0-9) and label them as 0. Then train a model to find if image belongs to MNIST or not.
Convolutional Neural Network (CNN) predicts the result from the defined classes after training. CNN always return from one of the classes regardless of accuracy. I have faced similar problem, what you can do is to check for accuracy value. If the accuracy is below some threshold value then it's belong to none category. Hope this helps.
You probably have three output nodes, and choose the maximum value (one-hot encoding). That's a bit unfortunate as it's a low number of outputs. Non-recognized inputs tend to cause pretty random outputs.
Now, with 3 outputs, roughly speaking you can get 7 outcomes. You might get a single high value (3 possibilities) but non-recognized input can also cause 2 high outputs (also 3 possibilities) or approximately equal output (also 3 possibilities). So there's a decent chance (~ 3/7) of random inputs producing a pattern on the output nodes which you'd only expect for a recognized input.
Now, if you had 15 classes and thus 15 output nodes, you'd be looking at roughly 32767 possible outcomes for unrecognized inputs, only 15 of which correspond to expected one-hot outcomes.
Underlying this is a lack of training data. If your training set has examples outside the 3 classes, you can just dump this in a 4th "other" category and train with that. This by itself isn't a reliable indication, as usually the theoretical "other" set is huge, but you now have 2 complementary ways of detecting other inputs: either by the "other" output node or by one of the 11 ambiguous outputs.
Another solution would be to check what outcome your CNN usually gives when given something else. I believe the last layer must be softmax and your CNN should return probabilities of the three given classes. If none of these probabilities is close to 1 this might be a sign that this is something else assuming your CNN is well trained (it must be fined for overconfidence when predicting wrong labels).

When training seq2seq model with bucketing method do we keep separate RNNs for each bucket?

Let's say we have 3 buckets of different lengths. So do we train 3 different nets?
Can't we keep a dynamic RNN. Where it will add units according to the length of input sequence in the encoder. Then encoder will pass the last hidden state to the decoder. Will it work?
I went through this. Bucketing is help to speed up the training process. We first divide examples in to buckets. Then we can reduce the number of units we have to pad.
In the training iterations we select one bucket at a time and train the whole network.
In the validation we check the perplexity of the test examples in each bucket.
Tensorflow support this bucketing.
Dynamic RNN is different here we don't have a bucketing mechanism. It we input data as a tensor with the shape of [batch_size,hidden size,max_seq_length].
Here sequences shorter than the maximum length should padded with zeros.
Then it will create dynamic RNNs that their length equal to actual inputs (without padded zeros). This uses a while loop in the tensorflow.