Tensorflow tf.data.Dataset API, dataset unzip function?

Tensorflow tf.data.Dataset API, dataset unzip function? - tensorflow

In tensorflow 1.12 there is the Dataset.zip function: documented here.
However, I was wondering if there is a dataset unzip function which will return back the original two datasets.
# NOTE: The following examples use `{ ... }` to represent the
# contents of a dataset.
a = { 1, 2, 3 }
b = { 4, 5, 6 }
c = { (7, 8), (9, 10), (11, 12) }
d = { 13, 14 }
# The nested structure of the `datasets` argument determines the
# structure of elements in the resulting dataset.
Dataset.zip((a, b)) == { (1, 4), (2, 5), (3, 6) }
Dataset.zip((b, a)) == { (4, 1), (5, 2), (6, 3) }
# The `datasets` argument may contain an arbitrary number of
# datasets.
Dataset.zip((a, b, c)) == { (1, 4, (7, 8)),
(2, 5, (9, 10)),
(3, 6, (11, 12)) }
# The number of elements in the resulting dataset is the same as
# the size of the smallest dataset in `datasets`.
Dataset.zip((a, d)) == { (1, 13), (2, 14) }
I would like to have the following
dataset = Dataset.zip((a, d)) == { (1, 13), (2, 14) }
a, d = dataset.unzip()

My workaround was to just use map, not sure if there might be interest in a syntax sugar function for unzip later though.
a = dataset.map(lambda a, b: a)
b = dataset.map(lambda a, b: b)

TensorFlow's get_single_element() is finally around which can be used to unzip datasets (as asked in the question above).
This avoids the need of generating and using an iterator using .map() or iter() (which could be costly for big datasets).
get_single_element() returns a tensor (or a tuple or dict of tensors) encapsulating all the members of the dataset. We need to pass all the members of the dataset batched into a single element.
This can be used to get features as a tensor-array, or features and labels as a tuple or dictionary (of tensor-arrays) depending upon how the original dataset was created.
import tensorflow as tf
a = [ 1, 2, 3 ]
b = [ 4, 5, 6 ]
c = [ (7, 8), (9, 10), (11, 12) ]
d = [ 13, 14 ]
# Creating datasets from lists
ads = tf.data.Dataset.from_tensor_slices(a)
bds = tf.data.Dataset.from_tensor_slices(b)
cds = tf.data.Dataset.from_tensor_slices(c)
dds = tf.data.Dataset.from_tensor_slices(d)
list(tf.data.Dataset.zip((ads, bds)).as_numpy_iterator()) == [ (1, 4), (2, 5), (3, 6) ] # True
list(tf.data.Dataset.zip((bds, ads)).as_numpy_iterator()) == [ (4, 1), (5, 2), (6, 3) ] # True
# Let's zip and unzip ads and dds
x = tf.data.Dataset.zip((ads, dds))
xa, xd = tf.data.Dataset.get_single_element(x.batch(len(x)))
xa = list(xa.numpy())
xd = list(xd.numpy())
print(xa, xd) # [1,2] [13, 14] # notice how xa is now different from a because ads was curtailed when zip was done above.
d == xd # True

Building on Ouwen Huang's answer, this function seems to work for arbitrary datasets:
def split_datasets(dataset):
tensors = {}
names = list(dataset.element_spec.keys())
for name in names:
tensors[name] = dataset.map(lambda x: x[name])
return tensors

I have written a more general unzip function for tf.data.Dataset pipelines, which also handles the recursive case where a pipeline has multiple levels of zipping.
import tensorflow as tf
def tfdata_unzip(
tfdata: tf.data.Dataset,
*,
recursive: bool=False,
eager_numpy: bool=False,
num_parallel_calls: int=tf.data.AUTOTUNE,
):
"""
Unzip a zipped tf.data pipeline.
Args:
tfdata: the :py:class:`tf.data.Dataset`
to unzip.
recursive: Set to ``True`` to recursively unzip
multiple layers of zipped pipelines.
Defaults to ``False``.
eager_numpy: Set this to ``True`` to return
Python lists of primitive types or
:py:class:`numpy.array` objects. Defaults
to ``False``.
num_parallel_calls: The level of parallelism to
each time we ``map()`` over a
:py:class:`tf.data.Dataset`.
Returns:
Returns a Python list of either
:py:class:`tf.data.Dataset` or NumPy
arrays.
"""
if isinstance(tfdata.element_spec, tf.TensorSpec):
if eager_numpy:
return list(tfdata.as_numpy_iterator())
return tfdata
def tfdata_map(i: int) -> list:
return tfdata.map(
lambda *cols: cols[i],
deterministic=True,
num_parallel_calls=num_parallel_calls,
)
if isinstance(tfdata.element_spec, tuple):
num_columns = len(tfdata.element_spec)
if recursive:
return [
tfdata_unzip(
tfdata_map(i),
recursive=recursive,
eager_numpy=eager_numpy,
num_parallel_calls=num_parallel_calls,
)
for i in range(num_columns)
]
else:
return [
tfdata_map(i)
for i in range(num_columns)
]
raise ValueError(
"Unknown tf.data.Dataset element_spec: " +
str(tfdata.element_spec)
)
Here is how tfdata_unzip() works, given these example datasets:
>>> import numpy as np
>>> baby = tf.data.Dataset.from_tensor_slices([
np.array([1,2]),
np.array([3,4]),
np.array([5,6]),
])
>>> baby.element_spec
TensorSpec(shape=(2,), dtype=tf.int64, name=None)
TensorSpec(shape=(2,), dtype=tf.int64, name=None)
>>> parent = tf.data.Dataset.zip((baby, baby))
>>> parent.element_spec
(TensorSpec(shape=(2,), dtype=tf.int64, name=None),
TensorSpec(shape=(2,), dtype=tf.int64, name=None))
>>> grandparent = tf.data.Dataset.zip((parent, parent))
>>> grandparent.element_spec
((TensorSpec(shape=(2,), dtype=tf.int64, name=None),
TensorSpec(shape=(2,), dtype=tf.int64, name=None)),
(TensorSpec(shape=(2,), dtype=tf.int64, name=None),
TensorSpec(shape=(2,), dtype=tf.int64, name=None)))
This is what tfdata_unzip() returns on the above baby, parent, and grandparent datasets:
>>> tfdata_unzip(baby)
<TensorSliceDataset shapes: (2,), types: tf.int64>
>>> tfdata_unzip(parent)
[<ParallelMapDataset shapes: (2,), types: tf.int64>,
<ParallelMapDataset shapes: (2,), types: tf.int64>]
>>> tfdata_unzip(grandparent)
[<ParallelMapDataset shapes: ((2,), (2,)), types: (tf.int64, tf.int64)>,
<ParallelMapDataset shapes: ((2,), (2,)), types: (tf.int64, tf.int64)>]
>>> tfdata_unzip(grandparent, recursive=True)
[[<ParallelMapDataset shapes: (2,), types: tf.int64>,
<ParallelMapDataset shapes: (2,), types: tf.int64>],
[<ParallelMapDataset shapes: (2,), types: tf.int64>,
<ParallelMapDataset shapes: (2,), types: tf.int64>]]
>>> tfdata_unzip(grandparent, recursive=True, eager_numpy=True)
[[[array([1, 2]), array([3, 4]), array([5, 6])],
[array([1, 2]), array([3, 4]), array([5, 6])]],
[[array([1, 2]), array([3, 4]), array([5, 6])],
[array([1, 2]), array([3, 4]), array([5, 6])]]]

Related

Keras Sequential with multiple inputs

Given 3 array as input to the network, it should learn what links data in 1st array, 2nd array, and 3rd array.
In particular:
1st array contains integer numbers (eg.: 2, 3, 5, 6, 7)
2nd array contains integer numbers (eg.: 3, 2, 4, 6, 2)
3rd array contains integer numbers that are the results of an operation done between data in 1st and 2nd array (eg.: 6, 6, 20, 36, 14).
As you can see from the example data here above, the operation done is a multiplication so the network should learn this, giving:
model.predict(11,2) = 22.
Here's the code I've used:
import logging
import numpy as np
import tensorflow as tf
primo = np.array([2, 3, 5, 6, 7])
secondo = np.array([3, 2, 4, 6, 2])
risu = np.array([6, 6, 20, 36, 14])
l0 = tf.keras.layers.Dense(units=1, input_shape=[1])
model = tf.keras.Sequential([l0])
input1 = tf.keras.layers.Input(shape=(1, ), name="Pri")
input2 = tf.keras.layers.Input(shape=(1, ), name="Sec")
merged = tf.keras.layers.Concatenate(axis=1)([input1, input2])
dense1 = tf.keras.layers.Dense(
2,
input_dim=2,
activation=tf.keras.activations.sigmoid,
use_bias=True)(merged)
output = tf.keras.layers.Dense(
1,
activation=tf.keras.activations.relu,
use_bias=True)(dense1)
model = tf.keras.models.Model([input1, input2], output)
model.compile(
loss="mean_squared_error",
optimizer=tf.keras.optimizers.Adam(0.1))
model.fit([primo, secondo], risu, epochs=500, verbose = False, batch_size=16)
print(model.predict(11, 2))
My questions are:
is it correct to concatenate the 2 input as I did? I don't understand if concatenating in such a way the network understand that input1 and input2 are 2 different data
I'm not able to make the model.predict() working, every attempt result in an error

Your model has two inputs, each with shape (None,1), so you need to use np.expand_dims:
print(model.predict([np.expand_dims(np.array(11), 0), np.expand_dims(np.array(2), 0)]))
Output:
[[20.316557]]

Shape of data changing in Tensorflow dataset

The shape of my data after the mapping function should be (257, 1001, 1). I asserted this condition in the function and the data passed without an issue. But when extracting a vector from the dataset, the shape comes out at (1, 257, 1001, 1). Tfds never fails to be a bloody pain.
The code:
def read_npy_file(data):
# 'data' stores the file name of the numpy binary file storing the features of a particular sound file
# as a bytes string.
# decode() is called on the bytes string to decode it from a bytes string to a regular string
# so that it can passed as a parameter into np.load()
data = np.load(data.decode())
# Shape of data is now (1, rows, columns)
# Needs to be reshaped to (rows, columns, 1):
data = np.reshape(data, (257, 1001, 1))
assert data.shape == (257, 1001, 1), f"Shape of spectrogram is {data.shape}; should be (257, 1001, 1)."
return data.astype(np.float32)
spectrogram_ds = tf.data.Dataset.from_tensor_slices((specgram_files, labels))
spectrogram_ds = spectrogram_ds.map(
lambda file, label: tuple([tf.numpy_function(read_npy_file, [file], [tf.float32]), label]),
num_parallel_calls=tf.data.AUTOTUNE)
num_files = len(train_df)
num_train = int(0.8 * num_files)
num_val = int(0.1 * num_files)
num_test = int(0.1 * num_files)
spectrogram_ds = spectrogram_ds.shuffle(buffer_size=1000)
specgram_train_ds = spectrogram_ds.take(num_train)
specgram_test_ds = spectrogram_ds.skip(num_train)
specgram_val_ds = specgram_test_ds.take(num_val)
specgram_test_ds = specgram_test_ds.skip(num_val)
specgram, _ = next(iter(spectrogram_ds))
# The following assertion raises an error; not the one in the read_npy_file function.
assert specgram.shape == (257, 1001, 1), f"Spectrogram shape is {specgram.shape}. Should be (257, 1001, 1)"
I thought that the first dimension represented the batch size, which is 1, of course, before batching. But after batching by calling batch(batch_size=64) on the dataset, the shape of a batch was (64, 1, 257, 1001, 1) when it should be (64, 257, 1001, 1).
Would appreciate any help.

Although I still can't explain why I'm getting that output, I did find a workaround. I simply reshaped the data in another mapping like so:
def read_npy_file(data):
# 'data' stores the file name of the numpy binary file storing the features of a particular sound file
# as a bytes string.
# decode() is called on the bytes string to decode it from a bytes string to a regular string
# so that it can passed as a parameter into np.load()
data = np.load(data.decode())
# Shape of data is now (1, rows, columns)
# Needs to be reshaped to (rows, columns, 1):
data = np.reshape(data, (257, 1001, 1))
assert data.shape == (257, 1001, 1), f"Shape of spectrogram is {data.shape}; should be (257, 1001, 1)."
return data.astype(np.float32)
specgram_ds = tf.data.Dataset.from_tensor_slices((specgram_files, one_hot_encoded_labels))
specgram_ds = specgram_ds.map(
lambda file, label: tuple([tf.numpy_function(read_npy_file, [file], [tf.float32, ]), label]),
num_parallel_calls=tf.data.AUTOTUNE)
specgram_ds = specgram_ds.map(lambda specgram, label: tuple([tf.reshape(specgram, (257, 1001, 1)), label]),
num_parallel_calls=tf.data.AUTOTUNE)

Mix dense and sparse tensors inside tf.data.Dataset api

Imagine, that i want to train model, which minimizes distance between image and query. From one side i have image features from CNN, from other side i have mappings from word to embedded vector(w2v for example):
def raw_data_generator():
for row in network_data:
yield (row["cnn"], row["w2v_indices"])
dataset = tf.data.Dataset.from_generator(raw_data_generator, (tf.float32, tf.int32))
dataset = dataset.prefetch(1000)
here i want to create batch, but i want to create dense batch for cnn features, and sparse batch for w2v, cause obviously it has variable length(and i want to use safe_embeddings_lookup_sparse). There is batch function for dense, and .apply(tf.contrib.data.dense_to_sparse_batch(..)) function for sparse, but how to use them simultaneously?

You could try creating two data sets (one for each feature), applying the appropriate batching to each and then zipping them together with tf.data.Dataset.zip.
#staticmethod
zip(datasets)
Creates a Dataset by zipping together the given datasets.
This method has similar semantics to the built-in zip() function in
Python, with the main difference being that the datasets argument can
be an arbitrary nested structure of Dataset objects. For example:
# NOTE: The following examples use `{ ... }` to represent the
# contents of a dataset.
a = { 1, 2, 3 }
b = { 4, 5, 6 }
c = { (7, 8), (9, 10), (11, 12) }
d = { 13, 14 }
# The nested structure of the `datasets` argument determines the
# structure of elements in the resulting dataset.
Dataset.zip((a, b)) == { (1, 4), (2, 5), (3, 6) }
Dataset.zip((b, a)) == { (4, 1), (5, 2), (6, 3) }
# The `datasets` argument may contain an arbitrary number of
# datasets.
Dataset.zip((a, b, c)) == { (1, 4, (7, 8)),
(2, 5, (9, 10)),
(3, 6, (11, 12)) }
# The number of elements in the resulting dataset is the same as
# the size of the smallest dataset in `datasets`.
Dataset.zip((a, d)) == { (1, 13), (2, 14) }

How to sort a multi-dimensional tensor using the returned indices of tf.nn.top_k?

I have two multi-dimensional tensors a and b. And I want to sort them by the values of a.
I found tf.nn.top_k is able to sort a tensor and return the indices which is used to sort the input. How can I use the returned indices from tf.nn.top_k(a, k=2) to sort b?
For example,
import tensorflow as tf
a = tf.reshape(tf.range(30), (2, 5, 3))
b = tf.reshape(tf.range(210), (2, 5, 3, 7))
k = 2
sorted_a, indices = tf.nn.top_k(a, k)
# How to sort b into
# sorted_b[0, 0, 0, :] = b[0, 0, indices[0, 0, 0], :]
# sorted_b[0, 0, 1, :] = b[0, 0, indices[0, 0, 1], :]
# sorted_b[0, 1, 0, :] = b[0, 1, indices[0, 1, 0], :]
# ...
Update
Combining tf.gather_nd with tf.meshgrid can be one solution. For example, the following code is tested on python 3.5 with tensorflow 1.0.0-rc0:
a = tf.reshape(tf.range(30), (2, 5, 3))
b = tf.reshape(tf.range(210), (2, 5, 3, 7))
k = 2
sorted_a, indices = tf.nn.top_k(a, k)
shape_a = tf.shape(a)
auxiliary_indices = tf.meshgrid(*[tf.range(d) for d in (tf.unstack(shape_a[:(a.get_shape().ndims - 1)]) + [k])], indexing='ij')
sorted_b = tf.gather_nd(b, tf.stack(auxiliary_indices[:-1] + [indices], axis=-1))
However, I wonder if there is a solution which is more readable and doesn't need to create auxiliary_indices above.

Your code have a problem.
b = tf.reshape(tf.range(60), (2, 5, 3, 7))
Because TensorFlow Cannot reshape a tensor with 60 elements to shape [2,5,3,7] (210 elements).
And you can't sort a rank 4 tensor (b) using indices of rank 3 tensors.

Generator to yield gap tuples from zipped iterables

Let's say that I have an arbitrary number of iterables, all of which can be assumed to be sorted, and contain elements all of the same type (integers, for illustration's sake).
a = (1, 2, 3, 4, 5)
b = (2, 4, 5)
c = (1, 2, 3, 5)
I would like to write a generator function yielding the following:
(1, None, 1)
(2, 2, 2)
(3, None, 3)
(4, 4, None)
(5, 5, 5)
In other words, progressively yield sorted tuples with gaps where elements are missing from the input iterables.

My take on this, using only iterators, not heaps:
a = (1, 2, 4, 5)
b = (2, 5)
c = (1, 2, 6)
d = (1,)
inputs = [iter(x) for x in (a, b, c, d)]
def minwithreplacement(currents, inputs, minitem, done):
for i in xrange(len(currents)):
if currents[i] == minitem:
try:
currents[i] = inputs[i].next()
except StopIteration:
currents[i] = None
done[0] += 1
yield minitem
else:
yield None
def dothing(inputs):
currents = [it.next() for it in inputs]
done = [0]
while done[0] != len(currents):
yield minwithreplacement(currents, inputs, min(x for x in currents if x), done)
print [list(x) for x in dothing(inputs)] #Consuming iterators for display purposes
>>>[[1, None, 1, 1], [2, 2, 2, None], [4, None, None, None], [5, 5, None, None], [None, None, 6, None]]

We first need a variation of heapq.merge which also yields the index. You can get that by copy-pasting heapq.merge, and replacing each yield v with yield itnum, v. (I omit that part from my answer for readability).
Now we can do:
from collections import deque, OrderedDict
def f(*iterables):
pending = OrderedDict()
for i, v in merge(iterables):
if (not pending) or pending.keys()[-1] < v:
# a new greatest value
pending[v] = [None] * len(iterables)
pending[v][i] = v
# yield all values smaller than v
while len(pending) > 1 and pending.keys()[0] < v:
yield pending.pop(pending.keys()[0])
# yield remaining
while pending:
yield pending.pop(pending.keys()[0])
print list(f((1,2,3,4,5), (2,4,5), (1,2,3,5)))
=> [[1, None, 1], [2, 2, 2], [3, None, 3], [4, 4, None], [5, 5, 5]]

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Tensorflow tf.data.Dataset API, dataset unzip function? - tensorflow

My workaround was to just use map, not sure if there might be interest in a syntax sugar function for unzip later though. a = dataset.map(lambda a, b: a) b = dataset.map(lambda a, b: b)

Building on Ouwen Huang's answer, this function seems to work for arbitrary datasets: def split_datasets(dataset): tensors = {} names = list(dataset.element_spec.keys()) for name in names: tensors[name] = dataset.map(lambda x: x[name]) return tensors

Related

Keras Sequential with multiple inputs

Shape of data changing in Tensorflow dataset

Mix dense and sparse tensors inside tf.data.Dataset api

How to sort a multi-dimensional tensor using the returned indices of tf.nn.top_k?

Generator to yield gap tuples from zipped iterables

Categories

Resources