Interleaving multiple TensorFlow datasets together - tensorflow

The current TensorFlow dataset interleave functionality is basically a interleaved flat-map taking as input a single dataset. Given the current API, what's the best way to interleave multiple datasets together? Say they have already been constructed and I have a list of them. I want to produce elements from them alternatively and I want to support lists with more than 2 datasets (i.e., stacked zips and interleaves would be pretty ugly).
Thanks! :)
#mrry might be able to help.

EDIT 2: See tf.contrib.data.choose_from_datasets. It performs deterministic dataset interleaving.
EDIT: See tf.contrib.data.sample_from_datasets. Even though it performs random sampling I guess it can be useful.
Even though this is not "clean", it is the only workaround I came up with.
datasets = [tf.data.Dataset...]
def concat_datasets(datasets):
ds0 = tf.data.Dataset.from_tensors(datasets[0])
for ds1 in datasets[1:]:
ds0 = ds0.concatenate(tf.data.Dataset.from_tensors(ds1))
return ds0
ds = tf.data.Dataset.zip(tuple(datasets)).flat_map(
lambda *args: concat_datasets(args)
)

Expanding user2781994 answer (with edits), here is how I implemented it:
import tensorflow as tf
ds11 = tf.data.Dataset.from_tensor_slices([1,2,3])
ds12 = tf.data.Dataset.from_tensor_slices([4,5,6])
ds13 = tf.data.Dataset.from_tensor_slices([7,8,9])
all_choices_ds = [ds11, ds12, ds13]
choice_dataset = tf.data.Dataset.range(len(all_choices_ds)).repeat()
ds14 = tf.contrib.data.choose_from_datasets(all_choices_ds, choice_dataset)
# alternatively:
# ds14 = tf.contrib.data.sample_from_datasets(all_choices_ds)
iterator = ds14.make_initializable_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
sess.run(iterator.initializer)
while True:
try:
value=sess.run(next_element)
except tf.errors.OutOfRangeError:
break
print(value)
The output is:
1
4
7
2
5
8
3
6
9

In Tensorflow 2.0
tot_imm_dataset1 = 105
tot_imm_dataset2 = 55
e = tf.data.Dataset.from_tensor_slices(tf.cast([1,0,1],tf.int64)).repeat(int(tot_imm_dataset1/2))
f=tf.data.Dataset.range(1).repeat(int(tot_imm_dataset2-tot_imm_dataset1/2))
choice=e.concatenate(f)
datasets=[dataset2,dataset1]
dataset_rgb_compl__con_patch= tf.data.experimental.choose_from_datasets(datasets, choice)
That works for me

Related

Tensorflow indexing into python list during tf.while_loop

I have this annoying problem and i dont know how to solve it.
I am reading in batches of data from a CSV using a dataset reader and am wanting to gather certain columns. The reader returns a tuple of tensors and, depending on which reader i use, columns are either indexed via integer or string.
I can easily enough do a for loop in python and slice the columns I want however I am wanting to do this in a tf.while_loop to take advantage of parallel execution.
This is where my issue lies - the iterator in the while loop is tensor based and i cannot use this to index into my dataset. If i try and evaluate it I get an error about the session not being the same etc etc
How can i use a while loop (or a map function) and have the function be able to index into a python list/dict without evaluating or running the iterator tensor?
Simple example:
some_data = [1,2,3,4,5]
x = tf.constant(0)
y = len(some_data)
c = lambda x: tf.less(x, y)
b = lambda x: some_data[x] <--- You cannot index like this!
tf.while_loop(c, b, [x])
Does this fit your requirement somewhat ? It does nothing apart from print the value.
import tensorflow as tf
from tensorflow.python.framework import tensor_shape
some_data = [11,222,33,4,5,6,7,8]
def func( v ):
print (some_data[v])
return some_data[v]
with tf.Session() as sess:
r = tf.while_loop(
lambda i, v: i < 4,
lambda i, v: [i + 1, tf.py_func(func, [i], [tf.int32])[0]],
[tf.constant(0), tf.constant(2, tf.int32)],
[tensor_shape.unknown_shape(), tensor_shape.unknown_shape()])
r[1].eval()
It prints
11
4
222
33
The order changes everytime but I guess tf.control_dependencies may be useful to control that.

What does batch, repeat, and shuffle do with TensorFlow Dataset?

I'm currently learning TensorFlow but I came across a confusion in the below code snippet:
dataset = dataset.shuffle(buffer_size = 10 * batch_size)
dataset = dataset.repeat(num_epochs).batch(batch_size)
return dataset.make_one_shot_iterator().get_next()
I know that first the dataset will hold all the data but what shuffle(),repeat(), and batch() do to the dataset?
Please help me with an example and explanation.
Update: Here is a small collaboration notebook for demonstration of this answer.
Imagine, you have a dataset: [1, 2, 3, 4, 5, 6], then:
How ds.shuffle() works
dataset.shuffle(buffer_size=3) will allocate a buffer of size 3 for picking random entries. This buffer will be connected to the source dataset.
We could image it like this:
Random buffer
|
| Source dataset where all other elements live
| |
↓ ↓
[1,2,3] <= [4,5,6]
Let's assume that entry 2 was taken from the random buffer. Free space is filled by the next element from the source buffer, that is 4:
2 <= [1,3,4] <= [5,6]
We continue reading till nothing is left:
1 <= [3,4,5] <= [6]
5 <= [3,4,6] <= []
3 <= [4,6] <= []
6 <= [4] <= []
4 <= [] <= []
How ds.repeat() works
As soon as all the entries are read from the dataset and you try to read the next element, the dataset will throw an error.
That's where ds.repeat() comes into play. It will re-initialize the dataset, making it again like this:
[1,2,3] <= [4,5,6]
What will ds.batch() produce
The ds.batch() will take the first batch_size entries and make a batch out of them. So, a batch size of 3 for our example dataset will produce two batch records:
[2,1,5]
[3,6,4]
As we have a ds.repeat() before the batch, the generation of the data will continue. But the order of the elements will be different, due to the ds.random(). What should be taken into account is that 6 will never be present in the first batch, due to the size of the random buffer.
The following methods in tf.Dataset :
repeat( count=0 ) The method repeats the dataset count number of times.
shuffle( buffer_size, seed=None, reshuffle_each_iteration=None) The method shuffles the samples in the dataset. The buffer_size is the number of samples which are randomized and returned as tf.Dataset.
batch(batch_size,drop_remainder=False) Creates batches of the dataset with batch size given as batch_size which is also the length of the batches.
An example that shows looping over epochs. Upon running this script notice the difference in
dataset_gen1 - shuffle operation produces more random outputs (this may be more useful while running machine learning experiments)
dataset_gen2 - lack of shuffle operation produces elements in sequence
Other additions in this script
tf.data.experimental.sample_from_datasets - used to combine two datasets. Note that the shuffle operation in this case shall create a buffer that samples equally from both datasets.
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3" # to avoid all those prints
os.environ["TF_GPU_THREAD_MODE"] = "gpu_private" # to avoid large "Kernel Launch Time"
import tensorflow as tf
if len(tf.config.list_physical_devices('GPU')):
tf.config.experimental.set_memory_growth(tf.config.list_physical_devices('GPU')[0], True)
class Augmentations:
def __init__(self):
pass
#tf.function
def filter_even(self, x):
if x % 2 == 0:
return False
else:
return True
class Dataset:
def __init__(self, aug, range_min=0, range_max=100):
self.range_min = range_min
self.range_max = range_max
self.aug = aug
def generator(self):
dataset = tf.data.Dataset.from_generator(self._generator
, output_types=(tf.float32), args=())
dataset = dataset.filter(self.aug.filter_even)
return dataset
def _generator(self):
for item in range(self.range_min, self.range_max):
yield(item)
# Can be used when you have multiple datasets that you wish to combine
class ZipDataset:
def __init__(self, datasets):
self.datasets = datasets
self.datasets_generators = []
def generator(self):
for dataset in self.datasets:
self.datasets_generators.append(dataset.generator())
return tf.data.experimental.sample_from_datasets(self.datasets_generators)
if __name__ == "__main__":
aug = Augmentations()
dataset1 = Dataset(aug, 0, 100)
dataset2 = Dataset(aug, 100, 200)
dataset = ZipDataset([dataset1, dataset2])
epochs = 2
shuffle_buffer = 10
batch_size = 4
prefetch_buffer = 5
dataset_gen1 = dataset.generator().shuffle(shuffle_buffer).batch(batch_size).prefetch(prefetch_buffer)
# dataset_gen2 = dataset.generator().batch(batch_size).prefetch(prefetch_buffer) # this will output odd elements in sequence
for epoch in range(epochs):
print ('\n ------------------ Epoch: {} ------------------'.format(epoch))
for X in dataset_gen1.repeat(1): # adding .repeat() in the loop allows you to easily control the end of the loop
print (X)
# Do some stuff at end of loop

How to flatten a tensorflow dataset along feature columns when using data.make_csv_dataset?

I am using tf.contrib.data.make_csv_dataset to read CSV files having differing numbers of feature columns.
After reading each file I want to concatenate all the feature columns.
dataset = tf.contrib.data.make_csv_dataset(file_names[0],48,select_columns=['Load_residential_multi_0','Load_residential_multi_1'],shuffle=False)
dataset = dataset.batch(2)
get_batch = dataset.make_one_shot_iterator()
get_batch = get_batch.get_next()
with tf.Session() as sess:
power_data = sess.run(get_batch)
print(power_data.keys())
Above code will give an ordered dictionary with two keys as shown below:
odict_keys(['Load_residential_multi_0', 'Load_residential_multi_1'])
I can access individual features using the feature names. For example power_data['Load_residential_multi_0'] will give me,
array([[0.075 , 0.1225, 0.0775, 0.12 ],
[0.0875, 0.1125, 0.095 , 0.1025]], dtype=float32)
However, I want both the feature columns 'Load_residential_multi_0','Load_residential_multi_1'to be concatenated.
I this I can do this using dataset.flatmap(map_func) but I am not sure what I should use as the argument to flatmap().
By using dataset.map you can concat both the dictionary values:
dataset = dataset.map(lambda x: tf.stack(list(x.values())))
get_batch = dataset.make_one_shot_iterator()

How to use dataset.shard in tensorflow?

Recently I am looking into the dataset API in Tensorflow, and there is a method dataset.shard() which is for distributed computations.
This is what's stated in Tensorflow's documentation:
Creates a Dataset that includes only 1/num_shards of this dataset.
d = tf.data.TFRecordDataset(FLAGS.input_file)
d = d.shard(FLAGS.num_workers, FLAGS.worker_index)
d = d.repeat(FLAGS.num_epochs)
d = d.shuffle(FLAGS.shuffle_buffer_size)
d = d.map(parser_fn, num_parallel_calls=FLAGS.num_map_threads)
This method is said to return a portion of the original dataset. If I have two workers, am I supposed to do:
d_0 = d.shard(FLAGS.num_workers, worker_0)
d_1 = d.shard(FLAGS.num_workers, worker_1)
......
iterator_0 = d_0.make_initializable_iterator()
iterator_1 = d_1.make_initializable_iterator()
for worker_id in workers:
with tf.device(worker_id):
if worker_id == 0:
data = iterator_0.get_next()
else:
data = iterator_1.get_next()
......
Because the documentation did not specify how to make subsequent calls, I am a bit confused here.
Thanks!
You should take a look at the tutorial on Distributed TensorFlow first to better understand how it works.
You have multiple workers, that each run the same code but with a small difference: each worker will have a different FLAGS.worker_index.
When you use tf.data.Dataset.shard, you will supply this worker index and the data will be split between workers equally.
Here is an example with 3 workers.
dataset = tf.data.Dataset.range(6)
dataset = dataset.shard(FLAGS.num_workers, FLAGS.worker_index)
iterator = dataset.make_one_shot_iterator()
res = iterator.get_next()
# Suppose you have 3 workers in total
with tf.Session() as sess:
for i in range(2):
print(sess.run(res))
We will have the output:
0, 3 on worker 0
1, 4 on worker 1
2, 5 on worker 2

how to get shuffled batch from tfrecords with limited memory but large data set?

Using the tensorflow function tf.train.shuffle_batch we get shuffled batch by reading tfrecord into memory as a queue and shuffling within the queue (Umm, if i get the right understanding). Now I have a highly ordered tfrecords (pics of the same label are written together) and a really large dataset (around 2,550,000 pics). I want to feed my Vgg-net with batch of random labels, but its impossible and ugly to read all pictures into memory and get shuffled. Is there any solution to this?
I thought about maybe first doing shuffling then writing them into TFrecord, but I can't figure out an effective way doing this...
my data are saved in this way:
enter image description here
Here is my code getting TFRecords:
dst = "/Users/cory/Desktop/3_key_frame"
classes=[]
for myclass in os.listdir(dst):
if myclass.find('.DS_Store')==-1:
classes.append(myclass)
writer = tf.python_io.TFRecordWriter("train.tfrecords")
for index, name in enumerate(classes):
class_path = dst +'/' + name
#print(class_path)
for img_seq in os.listdir(class_path):
if img_seq.find('DS_Store')==-1:
seq_pos = class_path +'/' + img_seq
if os.path.isdir(seq_pos):
for img_name in os.listdir(seq_pos):
img_path = seq_pos +'/' + img_name
img = Image.open(img_path)
img = img.resize((64,64))
img_raw = img.tobytes()
#print (img,index)
example = tf.train.Example(features=tf.train.Features(feature={
"label":tf.train.Feature(int64_list=tf.train.Int64List(value=[index])),
'img_raw':tf.train.Feature(bytes_list=tf.train.BytesList(value=[img_raw]))
}))
writer.write(example.SerializeToString())
writer.close()
I am presuming you have the known list of filenames and/or structure of your labelled dataset.
It may be worthwhile iterating through them on a class-by-class basis taking N amount each time. In essence interleaving the datasets so that you don't have sequential issues.
If I am understanding this correctly, your primary concern is when sampling your dataset from the TFRecord that a sub-set of your data may contain entirely 1 class, rather than a good representation?
If you structure it as:
0 0 0 0 1 1 1 1 2 2 2 2 0 0 0 0 1 1 1 1 2 2 2 2 ... etc
this may make the shuffle_batch more likely to create a nicer sample for training.
This is the solution I am following, as there appears to be no additional params for shuffling where you can specify to keep a uniform distribution of class labels amongst the set.
Supposing that your data is stored like this:
/path/to/images/LABEL_1/image001.jpg
/path/to/images/LABEL_1/image002.jpg
...
/path/to/images/LABEL_10/image001.jpg
Get all the filenames in a flat list and shuffle them:
import glob
import random
filenames = glob.glob('/path/to/images/**/*.jpg)
random.shuffle(filenames)
Create a dictionary to go from label name to numerical label:
class_to_index = {'LABEL_1':0, 'LABEL_2': 1} # more classes I assume...
Now you can loop over all images and retrieve the label
writer = tf.python_io.TFRecordWriter("train.tfrecords")
for f in filenames:
img = Image.open(f)
img = img.resize((64,64))
img_raw = img.tobytes()
label = f.split('/')[-2]
example = tf.train.Example(features=tf.train.Features(feature={
"label":tf.train.Feature(int64_list=tf.train.Int64List(value= class_to_index[label])),
'img_raw':tf.train.Feature(bytes_list=tf.train.BytesList(value=[img_raw]))
}))
writer.write(example.SerializeToString())
writer.close()
Hope this helps :)