I'm switching my old datalayer (using Queues) to the "new" and recommended Dataset API. I'm using it for the first time, so I'm providing code examples in case I got something fundamentally wrong.
I create my Dataset from a generator (that will read a file, and provide n samples). It's a small dataset and n_iterations >> n_samples, so I simply want to read this dataset over and over again, ideally shuffled.
sample_set = tf.data.Dataset.from_generator( data_generator(filename),
(tf.uint8, tf.uint8), (tf.TensorShape([256,256,4]), tf.TensorShape([256,256,1]))
)
with datagenerator:
class data_generator:
def __init__(self, filename):
self.filename= filename
def __call__(self):
with filename.open() as f:
for idx in f: yield img[idx], label[idx]
To actually use the data, I got that I need to define an Iterator
sample = sample_set.make_one_shot_iterator().get_next()
and then we are set to read data
while True:
try: my_sample = sess.run(sample)
except tf.errors.OutOfRangeError: break # this happens after dset is read once
But all available Iterators seem to be "finite", in the way that they read a dataset only once.
Is there a simple way to make reading from the Dataset endless?
Datasets have repeat and shuffle methods.
BUF_SIZE = 100 # choose it depending on your data
sample_set = tf.data.Dataset.from_generator( data_generator(filename),
(tf.uint8, tf.uint8), (tf.TensorShape([256,256,4]),
tf.TensorShape([256,256,1]))
).repeat().shuffle(BUF_SIZE)
The Dataset.repeat() transformation will repeat a dataset endlessly if you don't pass an explicit count to it:
sample_set = tf.data.Dataset.from_generator(
data_generator(filename), (tf.uint8, tf.uint8),
(tf.TensorShape([256,256,4]), tf.TensorShape([256,256,1])))
# Repeats `sample_set` endlessly.
sample_set = sample_set.repeat()
sample = sample_set.make_one_shot_iterator().get_next()
The reinitializable Iterator will work with reinitializing on the same dataset, so this code will read the same dataset over and over again:
sample = tf.data.Iterator.from_structure(sample_set.output_types,
sample_set.output_shapes).get_next()
sample_it.make_initializer(sample_set) # create initialize op
with tf.Session(config=config) as sess:
sess.run(sample_set_init_op) # initialize in the beginning
while True:
try:
my_sample = sess.run(sample)
except tf.errors.OutOfRangeError:
sess.run(sample_set_init_op) # re-initialize on same dataset
Related
I am following the Google Machine Learning Intensive Course. But it uses version 1.x of TensorFlow, so I was planning to change the exercises to be able to run them in TensorFlow 2.0. But I am stuck in that exercise:
https://colab.research.google.com/notebooks/mlcc/first_steps_with_tensor_flow.ipynb?utm_source=mlcc&utm_campaign=colab-external&utm_medium=referral&utm_content=firststeps-colab&hl=es#scrollTo=7UwqGbbxP53O
Specifically the code:
def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None):
"""Trains a linear regression model of one feature.
Args:
features: pandas DataFrame of features
targets: pandas DataFrame of targets
batch_size: Size of batches to be passed to the model
shuffle: True or False. Whether to shuffle the data.
num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely
Returns:
Tuple of (features, labels) for next data batch
"""
# Convert pandas data into a dict of np arrays.
features = {key:np.array(value) for key,value in dict(features).items()}
# Construct a dataset, and configure batching/repeating.
ds = Dataset.from_tensor_slices((features,targets)) # warning: 2GB limit
ds = ds.batch(batch_size).repeat(num_epochs)
# Shuffle the data, if specified.
if shuffle:
ds = ds.shuffle(buffer_size=10000)
# Return the next batch of data.
features, labels = ds.make_one_shot_iterator().get_next()
return features, labels
I have replaced features, labels = ds.make_one_shot_iterator().get_next() with features, labels = tf.compat.v1.data.make_one_shot_iterator(ds).get_next()
and it seems to work but make_one_shot_iterator() is depreceated, so, how can i replace it?
Also according to https://github.com/tensorflow/tensorflow/issues/29252 , I have tried
features, labels = ds.__iter__()
next(ds.__iter__())
return features, labels
but it returns the error __iter __ () is only supported inside of tf.function or when eager execution is enabled.
I am quite inexperienced in python and follow the course as a hobbyist. Any ideas on how to solve it? Thank you.
After several tests, the python hang was a local problem.
To replace features, labels = ds.make_one_shot_iterator (). Get_next () I have tried several things:
features, labels = ds.__iter__().get_next()
iterator = ds.__iter__()
features, labels = iterator.get_next()
it = iter(ds)
features, labels = next(it)
All three cases return __iter__() is only supported inside of tf.function or when eager execution is enabled. so I tried:
features, labels = ds
return ds
And also just:
return features, labels
And both returns the same error, finally I tried:
return ds
And mysteriously it worked, I have no idea why, but it did.
1). I doubt, that you've really got what you wanted. Because if your input really needed to be multi-input - then your ds unlikely suits, you just need the list... something like this:
features = tf.compat.v1.data.make_one_shot_iterator(train_dataset).get_next()
image, label = features['image'], features['label']
2). Concerning Iterator - now it is belonging to 'tf.data' - with 'tf.data.Iterator.get_next()' method as opposed to previous tf.data.Datasetds.make_one_shot_iterator() -- 'Dependency Invertion' (D from SOLID principles of dev.) perhaps was done, perhaps to refactor....
New Iterator-entity now could be used for tf.data.Dataset.from_generator() objects feeding from fn_generator in async-mode each chunk of data yielded -- here is example of Custom-tfds.core.GeneratorBasedBuilder overwritting...
I think, the overall architecture of tf-lib was refactored a little-bit, because the input started to eat batch-by-batch itself (due to dev.'s implementations) -- & make_one_shot_iterator applied for Dataset no more needed... Even for debugging there is .as_numpy_iterator(), & make_one_shot_iterator no more considered to be needed by developers
though sometimes people use:
iterator = iter(batched_dataset)
next_element = iterator.get_next()
cannot assume where this could be needed yet
P.S. BTW, as I remember smth from Debugger, if your container is hashable or not iterable (or correct me) - you can try:
iterator = iter(dataset)
# batch_features, batch_labels = iterator.get_next()
el = iterator.get_next()
batch_features= el[:]
print(batch_features)
batch_labels= el[:-1]
print(batch_labels)
works OK
I'm having problems with the function resnet50.preprocess_input() from tensorflow.compat.v1.keras.applications.resnet50
In particular, after several trial and error, I can say the problem comes when inside a dataset generator function, there is a call:
dataset.map(pre_processing_image)
where
def pre_processing_image(image):
image = resnet50.preprocess_input(image)
return image
and the dataset is splitted in batches. When I reach the last batch, no matter if it is complete or smaller, I get an error similar to
Tensor("Const:0", shape=(3,), dtype=float32) must be from the same graph as Tensor("BatchDatasetV2:0", shape=(), dtype=variant)
I can't really understand what is going on because
If I use another preprocess_input, such as the one of mobilenet, without changing anything else then there is no problem. By digging the code I found that those functions are all calling this one but mobilenet uses "mode='tf'" while resnet should use 'caffe'
The error isn't related to the fact the last batch is smaller compared to the others, I tried to make them all equals but the errors keeps happening at the last step of the first epoch of training
If I don't use map but instead pre_processing_image is called directly inside tf.data.Dataset.from_generator there is no problem.. only the code becomes a lot slower
To give you the full code:
def image_gen(ds_path, ds_scores=None):
for i, path in enumerate(ds_path):
img = im.load_img(path,
color_mode='rgb',
target_size=(NETWORK_INFO.value[1],NETWORK_INFO.value[1]),
interpolation='bilinear')
img_to_numpy = np.array(img)
if (ds_scores is not None):
yield img_to_numpy, ds_scores[i]
else:
yield img_to_numpy
def pre_processing_image(image, score=None):
image = resnet50.preprocess_input(image)
if score is None:
return image
else:
return image, score
def generator(batchsize, train=False, val=False, test=False, shuffle=False):
with tf.Session() as sess:
if (train):
dataset = tf.data.Dataset.from_generator(lambda: image_gen(train_paths, train_scores),
output_types=(tf.float32, tf.float32))
elif(val):
dataset = tf.data.Dataset.from_generator(lambda: image_gen(val_paths, val_scores),
output_types=(tf.float32, tf.float32))
else:
dataset = tf.data.Dataset.from_generator(lambda: image_gen(test_paths),
output_types=(tf.float32))
if (shuffle):
dataset = dataset.shuffle(buffer_size=10*batchsize)
dataset = dataset.batch(batchsize)
dataset = dataset.map(pre_processing_image,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
dataset = dataset.prefetch(buffer_size=2)
dataset = dataset.repeat(count = -1)
iterable = tf.data.make_initializable_iterator(dataset)
batch = iterable.get_next()
sess.run(iterable.initializer)
# yield all the time it is required
while True:
try:
yield sess.run(batch)
except tf.errors.OutOfRangeError:
pass
I tried to mess with the position of the map function and shuffle/prefatch parameters but nothing solved the issue. Finally as you can see I use the same function for both training and validation generator, I just change the input parameter to selecet with dataset the function should use
Solved the issue.
I tried to search for something similar but regarding other networks that shared the same image preprocessing (such as VGG16) and it comes out those related issues were keras bugs
I updated to the last commit the module keras-applications (commit, not release!) and the code now works without problems
It is recommended to use tensorflow dataset as the input pipeline which can be set up as follows:
# Specify dataset
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
# Suffle
dataset = dataset.shuffle(buffer_size=1e5)
# Specify batch size
dataset = dataset.batch(128)
# Create an iterator
iterator = dataset.make_one_shot_iterator()
# Get next batch
next_batch = iterator.get_next()
I should be able to get the batch size (either from dataset itself or from an iterator created from it, i.e. both iterator and next_batch). Maybe someone wants to know how many batches there are in the dataset or its iterators. Or how many batches have been called and how many remain in the iterator? One might also want to get particular elements, or even the entire dataset at once.
I wasn't able to find anything on the tensorflow documentation. Is this possible? If not, does anyone know if this has been requested as an issue on tensorflow GitHub?
Try this
import tensorflow as tf
import numpy as np
features=np.array([[3.0, 0.0], [1.0, 2.0], [0.0, 0.0]], dtype="float32")
labels=np.array([[0], [0], [1]], dtype="float32")
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
batch_size = 2
dataset = dataset.batch(batch_size)
iterator = dataset.make_initializable_iterator()
batch_data = iterator.get_next()
with tf.Session() as sess:
sess.run(iterator.initializer)
print(np.shape(sess.run(batch_data)[0])[0])
and you will see
In TF2 at least, the type of a dataset is statically defined and accessible via tf.data.Dataset.element_spec.
This is a somewhat complex return type because it has tuple nesting that matches your Dataset.
>>> tf.data.Dataset.from_tensor_slices([[[1]],[[2]]]).element_spec.shape
TensorShape([1, 1])
If your data is organized as a tuple[image, label], then you'd get a tuple of TensorSpecs. You can index into it if you are certain of the nesting of the return type. E.g.
>>> image = tf.data.Dataset.from_tensor_slices([[1],[2],[3],[4]]).batch(2, drop_remainder=True)
>>> label = tf.data.Dataset.from_tensor_slices([[1],[2],[3],[4]]).batch(2, drop_remainder=True)
>>> train = tf.data.Dataset.zip((image, label))
>>> train.element_spec[0].shape[0]
2
In TF2, tf.data.Datasets are iterables, so you can get a batch by simply doing:
batch = next(iter(dataset))
and then calculating the batch size is trivial since it becomes the size of the first dimension:
batch_size = batch.shape[0]
So a complete example would look like:
# Specify dataset
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
# Suffle
dataset = dataset.shuffle(buffer_size=1e5)
# Specify batch size
dataset = dataset.batch(128)
# Calculate and print batch size
batch_size = next(iter(dataset)).shape[0]
print('Batch size:', batch_size) # prints 128
Or, if you need it as a function:
def calculate_batch_size(dataset):
return next(iter(dataset)).shape[0]
Note that iterating over a dataset requires eager execution. Moreover, this solution assumes that your dataset is batched, and may get errors if this is not the case. You may also face errors if, after batching, you perform other operations on your dataset that change the shape of its elements.
I have a GCMLE experiment and I am trying to upgrade my input_fn to use the new tf.data functionality. I have created the following input_fn based off of this sample
def input_fn(...):
dataset = tf.data.Dataset.list_files(filenames).shuffle(num_shards) # shuffle up the list of input files
dataset = dataset.interleave(lambda filename: # mix together records from cycle_length number of shards
tf.data.TextLineDataset(filename).skip(1).map(lambda row: parse_csv(row, hparams)), cycle_length=5)
if shuffle:
dataset = dataset.shuffle(buffer_size = 10000)
dataset = dataset.repeat(num_epochs)
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
features = iterator.get_next()
labels = features.pop(LABEL_COLUMN)
return features, labels
my parse_csv is the same as what I used previously, but it is not currently working. I can fix some of the issues, but I don't fully understand why I am having these issues. Here is the start of my parse_csv() function
def parse_csv(..):
columns = tf.decode_csv(rows, record_defaults=CSV_COLUMN_DEFAULTS)
raw_features = dict(zip(FIELDNAMES, columns))
words = tf.string_split(raw_features['sentences']) # splitting words
vocab_table = tf.contrib.lookup.index_table_from_file(vocabulary_file = hparams.vocab_file,
default_value = 0)
....
Right away this tf.string_split() stops working and the error is ValueError: Shape must be rank 1 but is rank 0 for 'csv_preprocessing/input_sequence_generation/StringSplit' (op: 'StringSplit') with input shapes: [], []. -- this is easily solved by packing raw_features['sentences'] into a tensor via [raw_features['sentences']] but I do not understand why this is needed with the this dataset approach? How come in the old version this worked fine? For the shapes to match up with the rest of my model, I end up needing to remove this extra dimension at the end via words = tf.squeeze(words, 0) because I add this "unecessary" dimension to the tensor.
For whatever reason, I am also getting an error that the table is not initialized tensorflow.python.framework.errors_impl.FailedPreconditionError: Table not initialized. however, this code works completely fine with my old input_fn() (see below) so I don't know why I would now need to initialize the tables? I have not figured out a solution to this part. Is there anything that I am missing to be able to use tf.contrib.lookup.index_table_from_file within my parse_csv function?
For reference, this is my old input_fn() that still does work:
def input_fn(...):
filename_queue = tf.train.string_input_producer(tf.train.match_filenames_once(filenames),
num_epochs=num_epochs, shuffle=shuffle, capacity=32)
reader = tf.TextLineReader(skip_header_lines=skip_header_lines)
_, rows = reader.read_up_to(filename_queue, num_records=batch_size)
features = parse_csv(rows, hparams)
if shuffle:
features = tf.train.shuffle_batch(
features,
batch_size,
min_after_dequeue=2 * batch_size + 1,
capacity=batch_size * 10,
num_threads=multiprocessing.cpu_count(),
enqueue_many=True,
allow_smaller_final_batch=True
)
else:
features = tf.train.batch(
features,
batch_size,
capacity=batch_size * 10,
num_threads=multiprocessing.cpu_count(),
enqueue_many=True,
allow_smaller_final_batch=True
)
labels = features.pop(LABEL_COLUMN)
return features, labels
UPDATE TF 1.7
I am revisiting this with TF 1.7 (which should have all of the TF 1.6 features mentioned in #mrry answer) but I'm still unable to replicate the behavior. For my old input_fn() I am able to gete around 13 steps/sec. The new function that I am using is as follows:
def input_fn(...):
files = tf.data.Dataset.list_files(filenames).shuffle(num_shards)
dataset = files.apply(tf.contrib.data.parallel_interleave(lambda filename: tf.data.TextLineDataset(filename).skip(1), cycle_length=num_shards))
dataset = dataset.apply(tf.contrib.data.map_and_batch(lambda row:
parse_csv_dataset(row, hparams = hparams),
batch_size = batch_size,
num_parallel_batches = multiprocessing.cpu_count()))
dataset = dataset.prefetch(1)
if shuffle:
dataset = dataset.shuffle(buffer_size = 10000)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_initializable_iterator()
features = iterator.get_next()
tf.add_to_collection(tf.GraphKeys.TABLE_INITIALIZERS, iterator.initializer)
labels = {key: features.pop(key) for key in LABEL_COLUMNS}
return features, labels
I believe that I am following all of the performance guildines such as 1) use prefetch 2) use map_and_batch with num_parallel_batches = cores 3) use parallel_interleave 4) applying shuffle before the repeat. The only steps I am not using is the cache suggestion, but would expect that to really only help for epochs beyond the first one as well as "applying interleave, prefetch and shuffle first." -- however I found that having prefetch and shuffle after the map_and_batch was ~10% speedup.
BUFFER ISSUE
The first performance issue that I am noticing is with my old input_fn() it took me about 13 wall clock minutes to get through 20k steps, and yet even with the buffer_size of 10,000 (which I take to mean we are waiting until we have 10,000 batches processed) I am still waiting more than 40 minutes for the buffer to get full . Does it make sense to take this long? If I know that my sharded .csv's on GCS are already randomized, is it acceptable to have this shuffle/buffer size smaller? I am trying to replicate the behavior from tf.train.shuffle_batch() -- however, it seems that at worst it should take the same 13 mins that it took to reach 10k steps in order to fill up the buffer?
STEPS/SEC
Even once the buffer has filled up, the global steps/sec tops out around 3 steps/sec (often as low as 2 steps/sec) on the same model with the previous input_fn() that is getting ~13 steps/sec.
SLOPPY INTERLEAVE
I finall tried to replace parallel_interleave() with sloppy_interleave() as this is another suggestion from #mrry. When I switched to sloppy_interleave I got 14 steps/sec! I know this means that it is not deterministic, but that should really just mean it is not deterministic from one run (or epoch) to the next? Or are there larger implications for this? Should I be concerned about any real difference between the old shuffle_batch() method and sloppy_interleave? Does the fact that this results in a 4-5x improvement suggest what the previous blocking factor was?
In TF 1.4 (which is currently the latest version of TF that works with GCMLE) you will not be able to use make_one_shot_iterator() with the lookup tables (see relevant post) you will need to use Dataset.make_initializable_iterator() and then initialize iterator.initalizer with your default TABLES_INITIALIZER (from this post). Here is what the input_fn() should look like:
def input_fn(...):
dataset = tf.data.Dataset.list_files(filenames).shuffle(num_shards)
# Define `vocab_table` outside the map function and use it in `parse_csv()`.
vocab_table = tf.contrib.lookup.index_table_from_file(
vocabulary_file=hparams.vocab_file, default_value=0)
dataset = dataset.interleave(
lambda filename: (tf.data.TextLineDataset(filename)
.skip(1)
.map(lambda row: parse_csv(row, hparams),
num_parallel_calls=multiprocessing.cpu_count())),
cycle_length=5)
if shuffle:
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.repeat(num_epochs)
dataset = dataset.batch(batch_size)
iterator = dataset.make_initializable_iterator()
features = iterator.get_next()
# add iterator.intializer to be handled by default table initializers
tf.add_to_collection(tf.GraphKeys.TABLE_INITIALIZERS, iterator.initializer)
labels = features.pop(LABEL_COLUMN)
return features, labels
When you use tf.data.TextLineDataset, each element is a scalar string. In this respect, it is more similar to using tf.TextLineReader.read(), rather than the batch version tf.TextLineReader.read_up_to(), which returns a vector of strings. Unfortunately the tf.string_split() op demands a vector input (although this could potentially be changed in future), so the shape manipulation is currently necessary.
Lookup tables interact a little differently with the functions in tf.data. The intuition is that you should declare the lookup table once outside the Dataset.map() call (so that it will be initialized once) and then capture it inside the parse_csv() function to call vocab_table.lookup(). Something like the following should work:
def input_fn(...):
dataset = tf.data.Dataset.list_files(filenames).shuffle(num_shards)
# Define `vocab_table` outside the map function and use it in `parse_csv()`.
vocab_table = tf.contrib.lookup.index_table_from_file(
vocabulary_file=hparams.vocab_file, default_value=0)
def parse_csv(...):
columns = tf.decode_csv(rows, record_defaults=CSV_COLUMN_DEFAULTS)
raw_features = dict(zip(FIELDNAMES, columns))
words = tf.string_split([raw_features['sentences']]) # splitting words
# Use the captured `vocab_table` here.
word_indices = vocab_table.lookup(words)
# ...
features = ...
# NOTE: Structure the output here so that you can simply return
# the dataset from `input_fn()`.
labels = features.pop(LABEL_COLUMN)
return features, labels
# NOTE: Consider using `tf.contrib.data.parallel_interleave()` to perform
# the reads in parallel.
dataset = dataset.interleave(
lambda filename: (tf.data.TextLineDataset(filename)
.skip(1)
.map(lambda row: parse_csv(row, hparams),
num_parallel_calls=multiprocessing.cpu_count())),
cycle_length=5)
if shuffle:
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.repeat(num_epochs)
dataset = dataset.batch(batch_size)
# NOTE: Add prefetching here to run the input pipeline in the background.
dataset = dataset.prefetch(1)
# NOTE: This requires TensorFlow 1.5 or later, but this change simplifies the
# initialization of the lookup table.
return dataset
I am learning TensorFlow (TF), and its been just one day, so I apologize in advance if my doubt is too basic to ask.
I was studying the linear classification example on the official TF website.
The authors defined a function called input_fun to read the data. The function is as follows:
def input_fn(data_file, num_epochs, shuffle, batch_size):
"""Generate an input function for the Estimator."""
assert tf.gfile.Exists(data_file), (
'%s not found. Please make sure you have either run data_download.py or '
'set both arguments --train_data and --test_data.' % data_file)
def parse_csv(value):
print('Parsing', data_file)
columns = tf.decode_csv(value, record_defaults=_CSV_COLUMN_DEFAULTS)
features = dict(zip(_CSV_COLUMNS, columns))
labels = features.pop('income_bracket')
return features, tf.equal(labels, '>50K')
# Extract lines from input files using the Dataset API.
dataset = tf.data.TextLineDataset(data_file)
if shuffle:
dataset = dataset.shuffle(buffer_size=_NUM_EXAMPLES['train'])
dataset = dataset.map(parse_csv, num_parallel_calls=5)
# We call repeat after shuffling, rather than before, to prevent separate
# epochs from blending together.
dataset = dataset.repeat(num_epochs)
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
features, labels = iterator.get_next()
return features, labels
I am not able to understand the second last line. The one-shot-iterator calls get_next() only once but shouldn't it iterate on the data multiple times (i.e. number of rows times) to extract the rows, like this example here?
So here, get_next() is basically a dequeue op. The data is in a queue, when you consume (use/call) the element called by get_next(), it is removed from the queue, and the next image/labels is moved in its place, which is dequeued next time you call it.
So currently, this function only returns the tensorflow op for dequeing elements, you can consume it in your training loop.