The neural network I am currently working on is accepting a sparse tensor as input. I am reading my data from a TFRecord as follows:
_, examples = tf.TFRecordReader(options=options).read_up_to(
filename_queue, num_records=batch_size)
features = tf.parse_example(examples, features={
'input_feat': tf.SparseFeature(index_key='input_feat_idx',
value_key='input_feat_values',
dtype=tf.int64,
size=SIZE_FEATURE)})
It works like a charm but I was looking at the tf.data API which looks more convenient for a lot of tasks and I am not sure how to read tf.SparseTensor objects like I do with the tf.RecordReader and tf.parse_example(). Any idea?
TensorFlow 1.5 will add native support for tf.SparseTensor in the core transformations. (This is currently available if you pip install tf-nightly, or build from source on the master branch of TensorFlow.) This means that you can write your pipeline as the following:
# Create a dataset of string records from the input files.
dataset = tf.data.TFRecordReader(filenames)
# Convert each string record into a `tf.SparseTensor` representing a single example.
dataset = dataset.map(lambda record: tf.parse_single_example(
record, features={'input_feat': tf.SparseFeature(index_key='input_feat_idx',
value_key='input_feat_values',
dtype=tf.int64,
size=SIZE_FEATURE)})
# Stack together up to `batch_size` consecutive elements into a `tf.SparseTensor`
# representing a batch of examples.
dataset = dataset.batch(batch_size)
# Create an iterator to access the elements of `dataset` sequentially.
iterator = dataset.make_one_shot_iterator()
# `next_element` is a `tf.SparseTensor`.
next_element = iterator.get_next()
Related
Edit:
To clarify why this question is different from the suggested duplicates, this SO question follows up on those suggested duplicates, on what exactly is Keras doing with the techniques described in those SO questions. The suggested duplicates specify using a dataset API make_one_shot_iterator() in model.fit, my follow up is that make_one_shot_iterator() can only go through the dataset once, however in the solutions given, several epochs are specified.
This is a follow up to these SO questions
How to Properly Combine TensorFlow's Dataset API and Keras?
Tensorflow keras with tf dataset input
Using tf.data.Dataset as training input to Keras model NOT working
Where "Starting from Tensorflow 1.9, one can pass tf.data.Dataset object directly into keras.Model.fit() and it would act similar to fit_generator". Each example has a TF dataset one shot iterator fed into Kera's model.fit.
An example is given below
# Load mnist training data
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
training_set = tfdata_generator(x_train, y_train,is_training=True)
model = # your keras model here
model.fit(
training_set.make_one_shot_iterator(),
steps_per_epoch=len(x_train) // 128,
epochs=5,
verbose = 1)
However, according the the Tensorflow Dataset API guide (here https://www.tensorflow.org/guide/datasets ) :
A one-shot iterator is the simplest form of iterator, which only
supports iterating once through a dataset
So it's only good for 1 epoch. However, the codes in the SO questions specify several epochs, with the code example above specifying 5 epochs.
Is there any explanation for this contradiction? Does Keras somehow know that when the one shot iterator has gone through the dataset, it can re-initialize and shuffle the data?
You can simply pass dataset object to model.fit, Keras will handle iteration.
Considering one of pre-made datasets:
train, test = tf.keras.datasets.cifar10.load_data()
dataset = tf.data.Dataset.from_tensor_slices((train[0], train[1]))
This will create dataset object from training data of cifar10 dataset. In this case parse function isn't needed.
If you create dataset from path containing images of list of numpy arrays you'll need one.
dataset = tf.data.Dataset.from_tensor_slices((image_path, labels_path))
In case you'll need a function to load actual data from filename. Numpy array can be handled the same way just without tf.read_file
def parse_func(filename):
f = tf.read_file(filename)
image = tf.image.decode_image(f)
label = #get label from filename
return image, label
Then you can shuffle, batch, and map any parse function to this dataset. You can control how many examples will be preloaded with shuffle buffer. Repeat controls epoch count and better be left None, so it will repeat indefinitely. You can use either plain batch function or combine with
dataset = dataset.shuffle().repeat()
dataset.apply(tf.data.experimental.map_and_batch(map_func=parse_func, batch_size,num_parallel_batches))
Then dataset object can be passed to model.fit
model.fit(dataset, epochs, steps_per_epoch). Note that steps_per_epoch is a necessary parameter in this case, it will define when to start new epoch. So you'll have to know epoch size in advance.
I want to train a model in tf.keras of Tensorflow 2.0 with data that is bigger than my ram, but the tutorials only show examples with predefined datasets.
I followed this tutorial:
Load Images with tf.data, I could not make this work for data on numpy arrays or tfrecords.
This is an example with array being transformed into tensorflow datasets. What I want is to make this work for multiple numpy array files or multiple tfrecords files.
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
# Shuffle and slice the dataset.
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(64)
# Since the dataset already takes care of batching,
# we don't pass a `batch_size` argument.
model.fit(train_dataset, epochs=3)
If you have tfrecords files:
path = ['file1.tfrecords', 'file2.tfrecords', ..., 'fileN.tfrecords']
dataset = tf.data.Dataset.list_files(path, shuffle=True).repeat()
dataset = dataset.interleave(lambda filename: tf.data.TFRecordDataset(filename), cycle_length=len(path))
dataset = dataset.map(parse_function).batch()
parse_function handles decoding and any kind of augmentation.
In case with numpy arrays, you can construct dataset either from a list of filenames or from list of arrays. Labels are just a list. Or they could be taken from file while parsing single example.
path = #list of numpy arrays
or
path = os.listdir(path_to files)
dataset = tf.data.Dataset.from_tensor_slices((path, labels))
dataset = dataset.map(parse_function).batch()
parse_function handles decoding:
def parse_function(filename, label): #Both filename and label will be passed if you provided both to from_tensor_slices
f = tf.read_file(filename)
image = tf.image.decode_image(f))
image = tf.reshape(image, [H, W, C])
label = label #or it could be extracted from, for example, filename, or from file itself
#do any augmentations here
return image, label
To decode .npy files, the best way is to use reshape without read_file or decode_raw, but first load numpys with np.load:
paths = [np.load(i) for i in ["x1.npy", "x2.npy"]]
image = tf.reshape(filename, [2])
or try using decode_raw
f = tf.io.read_file(filename)
image = tf.io.decode_raw(f, tf.float32)
Then just pass batched dataset to model.fit(dataset). TensorFlow 2.0 allows simple iteration over dataset. No need to use iterator. Even in later versions of 1.x API you could just pass dataset to .fit method
for example in dataset:
func(example)
I want to feed my custom data deep network using "tfrecords" in tensorflow. There are several ways to do this; using coordinator or iterator. I was mixed up with this and tried several times using book's and blog's guide. But unfortunately, none of them did work for me.
Roughly speaking assume that I have tfrecords file and a model
How can we complete the following code to feed the training process with tfrecords?
for tf.Session() as sess:
# getting the images and labels
...
feed_dict = {x:images, y:labels}
loss = sess.run([optimization], feed_dict=feed_dict)
I looked at similar questions online, but those didn't answer what I am looking for.
Here is the official link:
https://www.tensorflow.org/api_guides/python/reading_data
You read TFRecord files with a tensorflow Dataset object or a TFRecordReader. I think this article explains it well (both writing and reading TFRecord files), using TFRecordReader (I prefer Dataset, myself).
Using dataset, your pipeline is something like this:
import tensorflow as tf
filenames = [...]
def _parse_fn(x):
feature_set = { 'image': tf.FixedLenFeature([], tf.string),
'label': tf.FixedLenFeature([], tf.int64)}
parsed = tf.parse_single_example(x, features= feature_set )
return parsed['image'], parsed['label']
dataset = (
tf.data
.TFRecordDataset(filenames)
.map(_parse_fn)
.shuffle(...)
.batch(...)
# etc.
)
iterator = dataset.make_one_shot_iterator()
next_item = iterator.get_next() # This is a nested structure of Tensors
image, label = next_item # Since our parse_fn returned a list of 2
... # now you build your model here, from the 'image' Tensor
# rather than from a placeholder as you do when using feed_dict
with tf.Session() as sess:
print sess.run([optimize])
The main thing to notice is that the output you will get from the Dataset or TFRecordReader is a Tensor, not a Numpy array object. Therefore, you will not use feed_dict, but rather build your model from the input tensor (as in, your model with extend directly from the image Tensor, rather than a Placeholder).
I created a dataset in TFRecord format for testing. Every entry contains 200 columns, named C1 - C199, each being a strings list, and a label column to denote the labels. The code to create the data can be found here: https://github.com/codescv/tf-dist/blob/8bb3c44f55939fc66b3727a730c57887113e899c/src/gen_data.py#L25
Then I used a linear model to train the data. The first approach looks like this:
dataset = tf.data.TFRecordDataset(data_file)
dataset = dataset.prefetch(buffer_size=batch_size*10)
dataset = dataset.map(parse_tfrecord, num_parallel_calls=5)
dataset = dataset.repeat(num_epochs)
dataset = dataset.batch(batch_size)
features, labels = dataset.make_one_shot_iterator().get_next()
logits = tf.feature_column.linear_model(features=features, feature_columns=columns, cols_to_vars=cols_to_vars)
train_op = ...
with tf.Session() as sess:
sess.run(train_op)
The full code can be found here: https://github.com/codescv/tf-dist/blob/master/src/lr_single.py
When I run the code above, I get 0.85 steps/sec (batch size being 1024).
In the second approach, I manually get batches from Dataset into python, then feed them to a placeholder, like this:
example = tf.placeholder(dtype=tf.string, shape=[None])
features = tf.parse_example(example, features=tf.feature_column.make_parse_example_spec(columns+[tf.feature_column.numeric_column('label', dtype=tf.float32, default_value=0)]))
labels = features.pop('label')
train_op = ...
dataset = tf.data.TFRecordDataset(data_file).repeat().batch(batch_size)
next_batch = dataset.make_one_shot_iterator().get_next()
with tf.Session() as sess:
data_batch = sess.run(next_batch)
sess.run(train_op, feed_dict={example: data_batch})
The full code can be found here: https://github.com/codescv/tf-dist/blob/master/src/lr_single_feed.py
When I run the code above, I get 5 steps/sec. That is 5x faster than the first approach. This is what I do not understand, because theoretically the second should be slower due to the extra serialization/deserialization of data batches.
Thanks!
There is currently (as of TensorFlow 1.9) a performance issue when using tf.data to map and batch tensors that have a large number of features with a small amount of data in each. The issue has two causes:
The dataset.map(parse_tfrecord, ...) transformation will execute O(batch_size * num_columns) small operations to create a batch. By contrast, feeding a tf.placeholder() to tf.parse_example() will execute O(1) operations to create the same batch.
Batching many tf.SparseTensor objects using dataset.batch() is much slower than directly creating the same tf.SparseTensor as the output of tf.parse_example().
Improvements to both these issues are underway, and should be available in a future version of TensorFlow. In the meantime, you can improve the performance of the tf.data-based pipeline by switching the order of the dataset.map() and dataset.batch() and rewriting the dataset.map() to work on a vector of strings, like the feeding based version:
dataset = tf.data.TFRecordDataset(data_file)
dataset = dataset.prefetch(buffer_size=batch_size*10)
dataset = dataset.repeat(num_epochs)
# Batch first to create a vector of strings as input to the map().
dataset = dataset.batch(batch_size)
def parse_tfrecord_batch(record_batch):
features = tf.parse_example(
record_batch,
features=tf.feature_column.make_parse_example_spec(
columns + [
tf.feature_column.numeric_column(
'label', dtype=tf.float32, default_value=0)]))
labels = features.pop('label')
return features, labels
# NOTE: Parallelism might not be as useful, because the individual map function now does
# more work per invocation, but you might want to experiment with this.
dataset = dataset.map(parse_tfrecord_batch)
# Add a prefetch at the end to pipeline execution.
dataset = dataset.prefetch(1)
features, labels = dataset.make_one_shot_iterator().get_next()
# ...
EDIT (2018/6/18): To answer your questions from the comments:
Why is dataset.map(parse_tfrecord, ...) O(batch_size * num_columns), not O(batch_size)? If parsing requires enumeration of the columns, why doesn't parse_example take O(num_columns)?
When you wrap TensorFlow code in a Dataset.map() (or other functional transformation) a constant number of extra operations per output are added to "return" values from the function and (in the case of tf.SparseTensor values) "convert" them to a standard format. When you directly pass the outputs of tf.parse_example() to the input of your model, these operations aren't added. While they are very small operations, executing so many of them can become a bottleneck. (Technically the parsing does take O(batch_size * num_columns) time, but the constants involved in parsing are much smaller than executing an operation.)
Why do you add a prefetch at the end of the pipeline?
When you're interested in performance, this is almost always the best thing to do, and it should improve the overall performance of your pipeline. For more information about best practices, see the performance guide for tf.data.
I am learning TensorFlow (TF), and its been just one day, so I apologize in advance if my doubt is too basic to ask.
I was studying the linear classification example on the official TF website.
The authors defined a function called input_fun to read the data. The function is as follows:
def input_fn(data_file, num_epochs, shuffle, batch_size):
"""Generate an input function for the Estimator."""
assert tf.gfile.Exists(data_file), (
'%s not found. Please make sure you have either run data_download.py or '
'set both arguments --train_data and --test_data.' % data_file)
def parse_csv(value):
print('Parsing', data_file)
columns = tf.decode_csv(value, record_defaults=_CSV_COLUMN_DEFAULTS)
features = dict(zip(_CSV_COLUMNS, columns))
labels = features.pop('income_bracket')
return features, tf.equal(labels, '>50K')
# Extract lines from input files using the Dataset API.
dataset = tf.data.TextLineDataset(data_file)
if shuffle:
dataset = dataset.shuffle(buffer_size=_NUM_EXAMPLES['train'])
dataset = dataset.map(parse_csv, num_parallel_calls=5)
# We call repeat after shuffling, rather than before, to prevent separate
# epochs from blending together.
dataset = dataset.repeat(num_epochs)
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
features, labels = iterator.get_next()
return features, labels
I am not able to understand the second last line. The one-shot-iterator calls get_next() only once but shouldn't it iterate on the data multiple times (i.e. number of rows times) to extract the rows, like this example here?
So here, get_next() is basically a dequeue op. The data is in a queue, when you consume (use/call) the element called by get_next(), it is removed from the queue, and the next image/labels is moved in its place, which is dequeued next time you call it.
So currently, this function only returns the tensorflow op for dequeing elements, you can consume it in your training loop.