Chunk tensorflow dataset records into multiple records - tensorflow

I have an unbatched tensorflow dataset that looks like this:
ds = ...
for record in ds.take(3):
print('data shape={}'.format(record['data'].shape))
-> data shape=(512, 512, 87)
-> data shape=(512, 512, 277)
-> data shape=(512, 512, 133)
I want to feed the data to my network in chunks of depth 5. In the example above, the tensor of shape (512, 512, 87) would be divided into 17 tensors of shape (512, 512, 5). The final 2 rows of the matrix (tensor[:,:, 85:87]) should be discarded.
For example:
chunked_ds = ...
for record in chunked_ds.take(1):
print('chunked data shape={}'.format(record['data'].shape))
-> chunked data shape=(512, 512, 5)
How can I get from ds to chunked_ds? tf.data.Dataset.window() looks like what I need but I cannot get this working.

This can be actually done using tf.data.Dataset-only operations:
data = tf.random.normal( shape=[ 10 , 512 , 512 , 87 ] )
ds = tf.data.Dataset.from_tensor_slices( ( data ) )
chunk_size = 5
chunked_ds = ds.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(tf.transpose(x, perm=[2, 0, 1])).batch(chunk_size, drop_remainder=True)) \
.map(lambda rec: tf.transpose(rec, perm=[1, 2, 0]))
What is going on there:
First, we treat each each record as a separate Dataset and we permute it so that the last dimension becomes the batch dimension (flat_map will flatten our internal datasets to Tensors again)
.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(tf.transpose(x, perm=[2, 0, 1])
Then we batch it by 5, but we do not care about remainder
.batch(chunk_size, drop_remainder=True))
Finally, re-permute tensors so that we have 512x512 at the beggining:
.map(lambda rec: tf.transpose(rec, perm=[1, 2, 0]))

In order to express my solution, I'll first create a dummy dataset, which 10 samples each of shape [ 512 , 512 , 87 ],
data = tf.random.normal( shape=[ 10 , 512 , 512 , 87 ] )
ds = tf.data.Dataset.from_tensor_slices( ( data ) )
On executing the below code,
for record in ds.take( 3 ):
print( record.shape )
We get the output,
(512, 512, 87)
(512, 512, 87)
(512, 512, 87)
For convenience, I have created a dataset in which the length of the last dimension is a constant i.e. 87 ( which contradicts your approach ). But the solution provided is independent of the length of the last dimension.
The solution,
# chunk/window size
chunk_depth = 5
# array to store the chunks
chunks = []
# Iterating through each sample in ds ( Note: ds.as_numpy_iterator() returns NumPy arrays )
for sample in ds.as_numpy_iterator():
# Length of the last dimension
feature_size = sample.shape[ 2 ]
# No. of chunks that can be produced
num_chunks = feature_size // chunk_depth
# Perform slicing along the last dimension, storing the "chunks" in the chunks array.
for i in range( 0 , num_chunks , chunk_depth ):
chunk = sample[ : , : , i : i + chunk_depth ]
chunks.append( chunk )
# Convert array -> tf.data.Dataset
chunked_ds = tf.data.Dataset.from_tensor_slices( ( chunks ) )
The output of the below code,
for sample in chunked_ds.take( 1 ):
print( sample.shape )
is as expected in the question,
(512, 512, 5)
The solution is available as a Colab notebook.

Related

Explain the outputs of Bidirectional LSTM . Output length is 5 , which of them are hidden state and cell state respectively?

I am trying to use bidirectional LSTM as encoder. I set return_sequence=True and return_state=True and I am getting an output list with a length of 5.
# define model
inputs1 = Input(shape=(3, 1))
lstm1 = Bidirectional(merge_mode = 'ave', layer = LSTM(3, return_sequences=True, return_state = True))(inputs1)
model = Model(inputs=inputs1, outputs=lstm1)
# define input data
data = array([0.1, 0.2, 0.3]).reshape((1,3,1))
# make and show prediction
pred = model.predict(data)
length of the output of Bidirectional lstm is 5
len(pred) # 5
Shapes of all output from Bidirectional LSTM is
for num,i in enumerate(pred):
print(num, ': ',i.shape)
output
0 : (1, 3, 3)
1 : (1, 3)
2 : (1, 3)
3 : (1, 3)
4 : (1, 3)
Since it is bidirectional, I am assuming two of them are hidden states and 2 of them are cell states. Tell me the sequence. Thank you

Keras Sequential with multiple inputs

Given 3 array as input to the network, it should learn what links data in 1st array, 2nd array, and 3rd array.
In particular:
1st array contains integer numbers (eg.: 2, 3, 5, 6, 7)
2nd array contains integer numbers (eg.: 3, 2, 4, 6, 2)
3rd array contains integer numbers that are the results of an operation done between data in 1st and 2nd array (eg.: 6, 6, 20, 36, 14).
As you can see from the example data here above, the operation done is a multiplication so the network should learn this, giving:
model.predict(11,2) = 22.
Here's the code I've used:
import logging
import numpy as np
import tensorflow as tf
primo = np.array([2, 3, 5, 6, 7])
secondo = np.array([3, 2, 4, 6, 2])
risu = np.array([6, 6, 20, 36, 14])
l0 = tf.keras.layers.Dense(units=1, input_shape=[1])
model = tf.keras.Sequential([l0])
input1 = tf.keras.layers.Input(shape=(1, ), name="Pri")
input2 = tf.keras.layers.Input(shape=(1, ), name="Sec")
merged = tf.keras.layers.Concatenate(axis=1)([input1, input2])
dense1 = tf.keras.layers.Dense(
2,
input_dim=2,
activation=tf.keras.activations.sigmoid,
use_bias=True)(merged)
output = tf.keras.layers.Dense(
1,
activation=tf.keras.activations.relu,
use_bias=True)(dense1)
model = tf.keras.models.Model([input1, input2], output)
model.compile(
loss="mean_squared_error",
optimizer=tf.keras.optimizers.Adam(0.1))
model.fit([primo, secondo], risu, epochs=500, verbose = False, batch_size=16)
print(model.predict(11, 2))
My questions are:
is it correct to concatenate the 2 input as I did? I don't understand if concatenating in such a way the network understand that input1 and input2 are 2 different data
I'm not able to make the model.predict() working, every attempt result in an error
Your model has two inputs, each with shape (None,1), so you need to use np.expand_dims:
print(model.predict([np.expand_dims(np.array(11), 0), np.expand_dims(np.array(2), 0)]))
Output:
[[20.316557]]

Shape of data changing in Tensorflow dataset

The shape of my data after the mapping function should be (257, 1001, 1). I asserted this condition in the function and the data passed without an issue. But when extracting a vector from the dataset, the shape comes out at (1, 257, 1001, 1). Tfds never fails to be a bloody pain.
The code:
def read_npy_file(data):
# 'data' stores the file name of the numpy binary file storing the features of a particular sound file
# as a bytes string.
# decode() is called on the bytes string to decode it from a bytes string to a regular string
# so that it can passed as a parameter into np.load()
data = np.load(data.decode())
# Shape of data is now (1, rows, columns)
# Needs to be reshaped to (rows, columns, 1):
data = np.reshape(data, (257, 1001, 1))
assert data.shape == (257, 1001, 1), f"Shape of spectrogram is {data.shape}; should be (257, 1001, 1)."
return data.astype(np.float32)
spectrogram_ds = tf.data.Dataset.from_tensor_slices((specgram_files, labels))
spectrogram_ds = spectrogram_ds.map(
lambda file, label: tuple([tf.numpy_function(read_npy_file, [file], [tf.float32]), label]),
num_parallel_calls=tf.data.AUTOTUNE)
num_files = len(train_df)
num_train = int(0.8 * num_files)
num_val = int(0.1 * num_files)
num_test = int(0.1 * num_files)
spectrogram_ds = spectrogram_ds.shuffle(buffer_size=1000)
specgram_train_ds = spectrogram_ds.take(num_train)
specgram_test_ds = spectrogram_ds.skip(num_train)
specgram_val_ds = specgram_test_ds.take(num_val)
specgram_test_ds = specgram_test_ds.skip(num_val)
specgram, _ = next(iter(spectrogram_ds))
# The following assertion raises an error; not the one in the read_npy_file function.
assert specgram.shape == (257, 1001, 1), f"Spectrogram shape is {specgram.shape}. Should be (257, 1001, 1)"
I thought that the first dimension represented the batch size, which is 1, of course, before batching. But after batching by calling batch(batch_size=64) on the dataset, the shape of a batch was (64, 1, 257, 1001, 1) when it should be (64, 257, 1001, 1).
Would appreciate any help.
Although I still can't explain why I'm getting that output, I did find a workaround. I simply reshaped the data in another mapping like so:
def read_npy_file(data):
# 'data' stores the file name of the numpy binary file storing the features of a particular sound file
# as a bytes string.
# decode() is called on the bytes string to decode it from a bytes string to a regular string
# so that it can passed as a parameter into np.load()
data = np.load(data.decode())
# Shape of data is now (1, rows, columns)
# Needs to be reshaped to (rows, columns, 1):
data = np.reshape(data, (257, 1001, 1))
assert data.shape == (257, 1001, 1), f"Shape of spectrogram is {data.shape}; should be (257, 1001, 1)."
return data.astype(np.float32)
specgram_ds = tf.data.Dataset.from_tensor_slices((specgram_files, one_hot_encoded_labels))
specgram_ds = specgram_ds.map(
lambda file, label: tuple([tf.numpy_function(read_npy_file, [file], [tf.float32, ]), label]),
num_parallel_calls=tf.data.AUTOTUNE)
specgram_ds = specgram_ds.map(lambda specgram, label: tuple([tf.reshape(specgram, (257, 1001, 1)), label]),
num_parallel_calls=tf.data.AUTOTUNE)

Sketch_RNN , ValueError: Cannot feed value of shape

I get the following error:
ValueError: Cannot feed value of shape (1, 251, 5) for Tensor u'vector_rnn_1/Placeholder_1:0', which has shape '(1, 117, 5)'
when running code from here
https://github.com/tensorflow/magenta-demos/blob/master/jupyter-notebooks/Sketch_RNN.ipynb
The error occurs in this method:
def encode(input_strokes):
strokes = to_big_strokes(input_strokes).tolist()
strokes.insert(0, [0, 0, 1, 0, 0])
seq_len = [len(input_strokes)]
draw_strokes(to_normal_strokes(np.array(strokes)))
return sess.run(eval_model.batch_z, feed_dict={eval_model.input_data: [strokes], eval_model.sequence_lengths: seq_len})[0]
I have to mention I trained my own model following the instructions here:
https://github.com/tensorflow/magenta/tree/master/magenta/models/sketch_rnn
Can someone help me into understanding and solving this issue ?
Thanks
Regards
For my case, the problem is caused by to_big_strokes() function. If you do not modify the to_big_stroke() in sketch_rnn/utils.py, it will by default prolong the input_strokes sequence to the length of 250.
All you need to do, is to modify the parameter max_len in that function. You need to change that value to the maximum sequence length of your own dataset, which is 21 for me, as the line marked with "change" shown below.
def to_big_strokes(stroke, max_len=21): # change: 250 -> 21
"""Converts from stroke-3 to stroke-5 format and pads to given length."""
# (But does not insert special start token).
result = np.zeros((max_len, 5), dtype=float)
l = len(stroke)
assert l <= max_len
result[0:l, 0:2] = stroke[:, 0:2]
result[0:l, 3] = stroke[:, 2]
result[0:l, 2] = 1 - result[0:l, 3]
result[l:, 4] = 1
return result
The problem was that the strokes size is not equal as the array size expected by the algorithm.
So adapting the strokes array fixed the issue.

Double Batching Tensorflow Input Data

I am implementing an convnet for token classification of string data. I
need to take in string data from a TFRecord, batch shuffled, then perform some processing which expands the data, and batch that again. Is this possible with two batch_shuffle operations?
This is what I need to do:
enqueue filenames into a filequeue
for each serialized Example, put onto a shuffle_batch
When I pull each example off the shuffle batch, I need to PAD it, replicate it by the sequence length, concatinating position vector, this creates multiple examples for each original example from the first batch. I need to batch it again.
Of course, one solution is to just preprocess the data before loading it into TF, but that will take up way more diskspace than is necessary.
DATA
Here is some sample data. I have two "Examples". Each Example contains features of a tokenized sentences and labels for each token:
sentences = [
[ 'the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog' '.'],
['then', 'the', 'lazy', 'dog', 'slept', '.']
]
sent_labels = [
['O', 'O', 'O', 'ANIMAL', 'O', 'O', 'O', 'ANIMAL', 'O'],
['O', 'O', 'O', 'ANIMAL', 'O', 'O']
]
Each "Example" Now has features as below (some reducution for clarity):
features {
feature {
key: "labels"
value {
bytes_list {
value: "O"
value: "O"
value: "O"
value: "ANIMAL"
...
}
}
}
feature {
key: "sentence"
value {
bytes_list {
value: "the"
value: "quick"
value: "brown"
value: "fox"
...
}
}
}
}
Transformation
After Batching the sparse data, I receive a sentence as list of tokens:
['the', 'quick', 'brown', 'fox', ...]
I need to PAD the list first to a predetermined SEQ_LEN, and then insert
position indices into each example, rotating the positions such that the
toke I want to classify is at pos 0, and every position token is relative
to the 0 position:
[
['the', 0 , 'quick', 1 , 'brown', 2 , 'fox', 3, 'PAD', 4] # classify 'the'
['the', -1, 'quick', 0 , 'brown', 1 , 'fox', 2 'PAD', 3 ] # classify 'quick
['the', -2, 'quick', -1, 'brown', 0 , 'fox', 1 'PAD', 2 ] # classify 'brown
['the', -3, 'quick', -2, 'brown', -1, 'fox', 0 'PAD', 1 ] # classify 'fox
]
Batching and ReBatching The Data
Here is a simplified version of what I'm trying to do:
# Enqueue the Filenames and serialize
filenames =[outfilepath]
fq = tf.train.string_input_producer(filenames, num_epochs=num_epochs, shuffle=True, name='FQ')
reader = tf.TFRecordReader()
key, serialized_example = reader.read(fq)
# Dequeue Examples of batch_size == 1. Because all examples are Sparse Tensors, do 1 at a time
initial_batch = tf.train.shuffle_batch([serialized_example], batch_size=1, capacity, min_after_dequeue)
# Parse Sparse Tensors, make into single dense Tensor
# ['the', 'quick', 'brown', 'fox']
parsed = tf.parse_example(data_batch, features=feature_mapping)
dense_tensor_sentence = tf.sparse_tensor_to_dense(parsed['sentence'], default_value='<PAD>')
sent_len = tf.shape(dense_tensor_sentence)[1]
SEQ_LEN = 5
NUM_PADS = SEQ_LEN - sent_len
#['the', 'quick', 'brown', 'fox', 'PAD']
padded_sentence = pad(dense_tensor_sentence, NUM_PADS)
# make sent_len X SEQ_LEN copy of sentence, position vectors
#[
# ['the', 0 , 'quick', 1 , 'brown', 2 , 'fox', 3, 'PAD', 4 ]
# ['the', -1, 'quick', 0 , 'brown', 1 , 'fox', 2 'PAD', 3 ]
# ['the', -2, 'quick', -1, 'brown', 0 , 'fox', 1 'PAD', 2 ]
# ['the', -3, 'quick', -2, 'brown', -1, 'fox', 0 'PAD', 1 ]
# NOTE: There is no row where PAD is with a position 0, because I don't
# want to classify the PAD token
#]
examples_with_positions = replicate_and_insert_positions(padded_sentence)
# While my SEQ_LEN will be constant, the sent_len will not. Therefore,
#I don't know the number of rows, but I can guarantee the number of
# columns. shape = (?,SEQ_LEN)
dynamic_input = final_reshape(examples_with_positions) # shape = (?, SEQ_LEN)
# Try Random Shuffle Queue:
# Rebatch <-- This is where the problem is
#reshape_concat.set_shape((None, SEQ_LEN))
random_queue = tf.RandomShuffleQueue(10000, 50, [tf.int64], shapes=(SEQ_LEN,))
random_queue.enqueue_many(dynamic_input)
batch = random_queue.dequeue_many(4)
init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer(), tf.initialize_all_tables())
sess = create_session()
sess.run(init_op)
#tf.get_default_graph().finalize()
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
try:
i = 0
while True:
print sess.run(batch)
i += 1
except tf.errors.OutOfRangeError as e:
print "No more inputs."
EDIT
I'm now trying to use the RandomShuffleQueue. On each enqueue, I would like to enqueue a batch with shape(None, SEQ_LEN). I've modified the code above to reflect this.
I no longer get complaints about the input shapes, but the queuing does hang at sess.run(batch)
I was approaching the entire problem incorrectly. I mistakenly was thinking that I had to define the complete shape of the batch while inserting into tf.batch_shuffle, but I actually needed to define only the shape of each element that I was inputing, and set enqueue_many=True.
Here is the the correct code:
single_batch=1
input_batch_size = 64
min_after_dequeue = 10
capacity = min_after_dequeue + 3 * input_batch_size
num_epochs=2
SEQ_LEN = 10
filenames =[outfilepath]
fq = tf.train.string_input_producer(filenames, num_epochs=num_epochs, shuffle=True)
reader = tf.TFRecordReader()
key, serialized_example = reader.read(fq)
# Dequeue examples of batch_size == 1. Because all examples are Sparse Tensors, do 1 at a time
first_batch = tf.train.shuffle_batch([serialized_example], ONE, capacity, min_after_dequeue)
# Get a single sentence and preprocess it shape=(sent_len)
single_sentence = tf.parse_example(first_batch, features=feature_mapping)
# Preprocess Sentence. shape=(sent_len, SEQ_LEN * 2). Each row is example
processed_inputs = preprocess(single_sentence)
# Re batch
input_batch = tf.train.shuffle_batch([processed_inputs],
batch_size=input_batch_size,
capacity=capacity, min_after_dequeue=min_after_dequeue,
shapes=[SEQ_LEN * 2], enqueue_many=True) #<- This is the fix
init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer(), tf.initialize_all_tables())
sess = create_session()
sess.run(init_op)
#tf.get_default_graph().finalize()
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
try:
i = 0
while True:
print i
print sess.run(input_batch)
i += 1
except tf.errors.OutOfRangeError as e:
print "No more inputs."