How does CNTK use sequence ids during training? - cntk

Sequence ids are optional in CNTK text format. I am wondering how sequence ids are used during training. When a minibatch is created from a CNTKTextFormat with sequence ids, is a line in CNTKTextFormat considered as one sample or all lines with the same sequence id together is considered as one sample?

If IDs are given, then all lines with the same sequence ID are forming together one training instance (in CNTK lingo: they are forming a sequence consisting of samples).
If IDs are missing, then each line is a new training instance (consisting of a sequence with a single sample).

Related

Number of distinct labels and input data shape in tf.data Dataset

The Tensorflow Fashion-MNIST tutorial is great... but it seems clear you have to know in advance that there are 10 distinct labels in the dataset, and that the input data is image data of size 28x28. I would have thought these details should be readily discoverable from the dataset itself - is this possible? Could I discover the same information the same way on a quite different dataset (e.g. the Titanic Dataset, which comprises M rows by N columns of CSV data, and is a binary classification task). tf.data.Dataset does not appear to have any obvious get_label_count() or get_input_shape() functions in its API. Call me a newbie, but this suprises/confuses me.
According to the accepted answer to this question, Tensorflow tf.data.Dataset instances are lazily evaluated, meaning that you could, in principle, need to iterate the through an entire dataset to establish the number of distinct labels, and the input data shape(s) (which can be variable, for example with variable-length sequences of sound or text).

How to reuse one tensor when creating a tensorflow dataset iterator of a pair of tensors?

Imagine the case that I want to pair samples from one pool of data with samples from another pool of data together to feed into the network. But many samples from the first pool should be paired with the same sample in the second pool. (let's assume all samples are of the same shape).
For example, if we denote the samples from the first pool as f_i, samples from the second pool as g_j, I might want to have a mini-batch of samples as below (each line is one sample in the mini-batch):
(f_0, g_0)
(f_1, g_0)
(f_2, g_0)
(f_3, g_0)
...
(f_10, g_0)
(f_11, g_1)
(f_12, g_1)
(f_13, g_1)
...
(f_19, g_1)
...
If the data from the second pool are small (like labels), then I can simply store them together with samples from the first pool to tfrecords. But in my case the data from the second pool are of the same size as data from the first pool (for example, both are movie segments). Then saving them in pair in one tfrecords files seems to almost double the disk space use.
I wonder if there is any way in which I can only save all the samples from the second pool once on the disk, but still feed the data to my network as the way I wanted? (Assume I already have already specified the mapping between samples in the first pool and those from the second pool based on their file names).
Thanks a lot!
You can use an iterator for each one of the tfrecords (or pool of samples), so you get two iterators where each can iterate on its own pace. When you run get_next() on an iterator, the next sample is returned, so you have to keep it in a tensor and manually feed it. Quoting from the documentation:
(Note that, like other stateful objects in TensorFlow, calling Iterator.get_next() does not immediately advance the iterator. Instead you must use the returned tf.Tensor objects in a TensorFlow expression, and pass the result of that expression to tf.Session.run() to get the next elements and advance the iterator.)
So all you need is a couple of loops that iterate and combine samples from each iterator as a pair, and then you can feed this when you run your desired operation. For example:
g_iterator = g_dataset.make_one_shot_iterator()
get_next_g = g_iterator.get_next()
f_iterator = f_dataset.make_one_shot_iterator()
get_next_f = f_iterator.get_next()
# loop g:
temp_g = session.run(get_next_g)
# loop f:
temp_f = session.run(get_next_f)
session.run(train, feed_dict={f: temp_f, g: temp_g})
Does this answer your question?

Setting up the input on an RNN in Keras

So I had a specific question with setting up the input in Keras.
I understand that the sequence length refers to the window length of the longest sequence that you are looking to model with the rest being padded by 0's.
However, how do I set up something that is already in a time series array?
For example, right now I have an array that is 550k x 28. So there are 550k rows each with 28 columns (27 features and 1 target). Do I have to manually split the array into (550k- sequence length) different arrays and feed all of those to the network?
Assuming that I want to the first layer to be equivalent to the number of features per row, and looking at the past 50 rows, how do I size the input layer?
Is that simply input_size = (50,27), and again do I have to manually split the dataset up or would Keras automatically do that for me?
RNN inputs are like: (NumberOfSequences, TimeSteps, ElementsPerStep)
Each sequence is a row in your input array. This is also called "batch size", number of examples, samples, etc.
Time steps are the amount of steps for each sequence
Elements per step is how much info you have in each step of a sequence
I'm assuming the 27 features are inputs and relate to ElementsPerStep, while the 1 target is the expected output having 1 output per step.
So I'm also assuming that your output is a sequence with also 550k steps.
Shaping the array:
Since you have only one sequence in the array, and this sequence has 550k steps, then you must reshape your array like this:
(1, 550000, 28)
#1 sequence
#550000 steps per sequence
#28 data elements per step
#PS: this sequence is too long, if it creates memory problems to you, maybe it will be a good idea to use a `stateful=True` RNN, but I'm explaining the non stateful method first.
Now you must split this array for inputs and targets:
X_train = thisArray[:, :, :27] #inputs
Y_train = thisArray[:, :, 27] #targets
Shaping the keras layers:
Keras layers will ignore the batch size (number of sequences) when you define them, so you will use input_shape=(550000,27).
Since your desired result is a sequence with same length, we will use return_sequences=True. (Else, you'd get only one result).
LSTM(numberOfCells, input_shape=(550000,27), return_sequences=True)
This will output a shape of (BatchSize, 550000, numberOfCells)
You may use a single layer with 1 cell to achieve your output, or you could stack more layers, considering that the last one should have 1 cell to match the shape of your output. (If you're using only recurrent layers, of course)
stateful = True:
When you have sequences so long that your memory can't handle them well, you must define the layer with stateful=True.
In that case, you will have to divide X_train in smaller length sequences*. The system will understand that every new batch is a sequel of the previous batches.
Then you will need to define batch_input_shape=(BatchSize,ReducedTimeSteps,Elements). In this case, the batch size should not be ignored like in the other case.
* Unfortunately I have no experience with stateful=True. I'm not sure about whether you must manually divide your array (less likely, I guess), or if the system automatically divides it internally (more likely).
The sliding window case:
In this case, what I often see is people dividing the input data like this:
From the 550k steps, get smaller arrays with 50 steps:
X = []
for i in range(550000-49):
X.append(originalX[i:i+50]) #then take care of the 28th element
Y = #it seems you just exclude the first 49 ones from the original

Tensorflow: getting outputs form bidirectional_rnn with variable sequence length

I'm using tf.nn.bidirectional_rnn with the sequence_length parameter for variable input size, and I can't figure out how to get the final output for each sample in the minibatch:
output, _, _ = tf.nn.bidirectional_rnn(forward1,backward1,input,dtype=tf.float32,sequence_length=input_lengths)
Now, if I had constant sequence lengths, I would simply use output[-1] and get the final output. In my case I have variable sequences (their lengths are known).
Also, is this output the output of both forward and backward LSTMs?
Thanks.
This question can be answered by looking at the source code rnn.py.
For sequences with dynamic length, the source code says:
If the sequence_length vector is provided, dynamic calculation is
performed. This method of calculation does not compute the RNN steps
past the maximum sequence length of the minibatch (thus saving
computational time), and properly propagates the state at an
example's sequence length to the final state output.
Therefore, in order to get the actual last output, you should slice the resulting output.
For bidirectional_rnn, the source code says:
A tuple (outputs, output_state_fw, output_state_bw) where:
outputs is a length T list of outputs (one for each input), which
are depth-concatenated forward and backward outputs.
output_state_fw is the final state of the forward rnn.
output_state_bw is the final state of the backward rnn.
Therefore, the output is a tuple rather than a tensor.
You can concatenate this tuple into a vector if you wish.

Variable length dimension in tensor

I'm trying to implement the paper "End-to-End memory networks" (http://arxiv.org/abs/1503.08895)
Each training example consists of a number of phrases, a question and then the answer. The number of sentences is variable, as is the number of words in each sentence and the question. Each word is encoded as an integer. So my input would have the form [batch size, # of sentences, # words in sentence].
Now my problem is that the second and third dimension are unknown for each mini-batch. Can I still somehow represent this input as a single tensor or do I have to use lists of tensors, so that I have a list of length batch_size, and then a sublist of length number of sentences and then for each sentence a tensor, whose size is also not known in advance, corresponding to the words encoded as integers.
Can I use this second approach or will tensorflow then not be able to backpropagate, e.g. I have an operation where I have to calculate the following sum: \sum_i tf.scalar_mul(p_i, c_i), where p_i is a scalar and c_i is an embedding vector that was previously calculated. The tensors for the p and c values are then stored in a list, so I would have to sum over the elements in the two lists in a loop. I'm assuming that tensorflow would not be able to incoorporate this loop in the computation graph, correct? I'm sceptical since theano has a special scan function that allows one to loop over input, so I'm assuming that a regular loop would cause problems in the computation graph. How does tensorflow handle this?
Moving Yaroslav's comment to an answer:
TensorFlow has tf.scan. Dimensions may also be dynamic as in Theano.