Getting filename a exampe came from in tf,parse_exampes - tensorflow

I am writing a Data input pipeline in tensorflow that uses a bunch of tfrecord files with different Examples (types).
I am using code like:
filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(_parse_function)
However I want my parse_function to be different for file1.tfrecord than for file2.tfrecord. How do I achieve this. Is there someway of knowin in parse_example which file a particular example came from?

You can use a Dataset.flat_map() transformation to include the filename with each record as follows:
filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
filenames = tf.data.from_tensor_slices(filenames)
# `Dataset.flat_map()` creates a nested dataset from each element in `filenames`.
#
# For each file in filename, zip together the filename (repeated infinitely) with
# the records read from that file.
dataset = filenames.flat_map(
lambda fn: tf.data.Dataset.zip((tf.data.Dataset.from_tensors(fn).repeat(None),
tf.data.TFRecordDataset(fn))))
# The _parse_function can now be modified to take both the filename and the record.
dataset = dataset.map(lambda fn, record: _parse_function(fn, record))

Related

Dataset API 'flat_map' method producing error for same code which works with 'map' method

I am trying to create a create a pipeline to read multiple CSV files using TensorFlow Dataset API and Pandas. However, using the flat_map method is producing errors. However, if I am using map method I am able to build the code and run it in session. This is the code I am using. I already opened #17415 issue in TensorFlow Github repository. But apparently, it is not an error and they asked me to post here.
folder_name = './data/power_data/'
file_names = os.listdir(folder_name)
def _get_data_for_dataset(file_name,rows=100):#
print(file_name.decode())
df_input=pd.read_csv(os.path.join(folder_name, file_name.decode()),
usecols =['Wind_MWh','Actual_Load_MWh'],nrows = rows)
X_data = df_input.as_matrix()
X_data.astype('float32', copy=False)
return X_data
dataset = tf.data.Dataset.from_tensor_slices(file_names)
dataset = dataset.flat_map(lambda file_name: tf.py_func(_get_data_for_dataset,
[file_name], tf.float64))
dataset= dataset.batch(2)
fiter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()
I get the following error: map_func must return a Dataset object. The pipeline works without error when I use map but it doesn't give the output I want. For example, if Pandas is reading N rows from each of my CSV files I want the pipeline to concatenate data from B files and give me an array with shape (N*B, 2). Instead, it is giving me (B, N,2) where B is the Batch size. map is adding another axis instead of concatenating on the existing axis. From what I understood in the documentation flat_map is supposed to give a flatted output. In the documentation, both map and flat_map returns type Dataset. So how is my code working with map and not with flat_map?
It would also great if you could point me towards code where Dataset API has been used with Pandas module.
As mikkola points out in the comments, the Dataset.map() and Dataset.flat_map() expect functions with different signatures: Dataset.map() takes a function that maps a single element of the input dataset to a single new element, whereas Dataset.flat_map() takes a function that maps a single element of the input dataset to a Dataset of elements.
If you want each row of the array returned by _get_data_for_dataset() to
become a separate element, you should use Dataset.flat_map() and convert the output of tf.py_func() to a Dataset, using Dataset.from_tensor_slices():
folder_name = './data/power_data/'
file_names = os.listdir(folder_name)
def _get_data_for_dataset(file_name, rows=100):
df_input=pd.read_csv(os.path.join(folder_name, file_name.decode()),
usecols=['Wind_MWh', 'Actual_Load_MWh'], nrows=rows)
X_data = df_input.as_matrix()
return X_data.astype('float32', copy=False)
dataset = tf.data.Dataset.from_tensor_slices(file_names)
# Use `Dataset.from_tensor_slices()` to make a `Dataset` from the output of
# the `tf.py_func()` op.
dataset = dataset.flat_map(lambda file_name: tf.data.Dataset.from_tensor_slices(
tf.py_func(_get_data_for_dataset, [file_name], tf.float32)))
dataset = dataset.batch(2)
iter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()

Live refreshing filenames list

What is the proper way of live updating the file list while keeping the tensorflow session is open?
I am aware of the word once in the tf.train.match_filenames_once() function name but could not find an alternative. Here is a snippet showing what I am trying to do:
import tensorflow as tf
pattern = '/tmp/test_file_*'
filenames = tf.train.match_filenames_once(pattern)
# Create some files
for i in range(3):
with open(f'/tmp/test_file_old_{i}', 'w'):
pass
# Open session and initialize vars
init_op = tf.local_variables_initializer()
sess = tf.Session()
sess.run(init_op)
# Read filenames list before adding new files
filenames_result_1 = sess.run(filenames)
print('First filenames list:')
print(filenames_result_1)
# Create some new files
for i in range(3):
with open(f'/tmp/test_file_new_{i}', 'w'):
pass
# Read filenames list after adding new files
filenames_result_2 = sess.run(filenames)
print('Second filenames list:')
print(filenames_result_2)
# Trying to reinitialize the contents
sess.run(init_op)
filenames_result_3 = sess.run(filenames)
print('Third filenames list:')
print(filenames_result_3)
Output:
First filenames list:
[b'/tmp/test_file_old_1' b'/tmp/test_file_old_0' b'/tmp/test_file_old_2']
Second filenames list:
[b'/tmp/test_file_old_1' b'/tmp/test_file_old_0' b'/tmp/test_file_old_2']
Third filenames list:
[b'/tmp/test_file_old_1' b'/tmp/test_file_old_0' b'/tmp/test_file_old_2']
I would have liked to obtain the up-to-date list of files instead.

Skip Dataset entries in TFRecordDataset.map()

How do I skip entries in a TFRecord file when generating a TFRecordDataset?
Given a TFRecord file and tf.contrib.data.TFRecordDataset object, I create a new dataset by maping over the protobuf definition. For example,
features = {'some_data': tf.FixedLenFeature([], tf.string)}
def parser(example_proto):
e = tf.parse_single_example(example_proto, features)
data = e['some_data']
# ...do a bunch of stuff to data...
return data
x = TFRecordDataset(filename)
x = x.map(parser)
x = x.cache(cache_filename)
x = x.repeat()
x = x.batch(batch_size)
This lets me read in the data and do some preprocessing, then cache the results and batch it up for my model.
My question is, what if I want to skip one of the TFRecord entries (e.g., if the data is invalid/bad)? For example, in parser(), maybe I could return None, or some sort of tf.cond to indicate an invalid entry, or trip some assertion.
(Summarizing the comment as an answer)
The filter() method of Dataset could filter entries according to a predicate.

How to get images filenames from minibatch?

I'm working on this tutorial:
https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_201B_CIFAR-10_ImageHandsOn.ipynb
The test / train data files are simple tab separated text files containing image filenames and correct labels like this:
...\data\CIFAR-10\test\00000.png 3
...\data\CIFAR-10\test\00001.png 8
...\data\CIFAR-10\test\00002.png 8
Assume I create a minibatch like this:
test_minibatch = reader_test.next_minibatch(10)
How can I get to the filenames for the images, which was in the first column of the test data file?
I tried with this code:
orig_features = np.asarray(test_minibatch[features_stream_info].m_data)
print(orig_features)
But, that results in printing the bytes of the images itself.
The file name is lost when loading the images through image reader.
One possible solution is to use a composite reader to load the map file in text format simultaneously. We have composite reader example in here with BrainScript:
https://github.com/Microsoft/CNTK/tree/master/Examples/Image/Regression
With Python, you could do something like:
# read images
image_source = ImageDeserializer(map_file)
image_source.ignore_labels()
image_source.map_features(features_stream_name,
[ImageDeserializer.scale(width=image_width, height=image_height, channels=num_channels,
scale_mode="pad", pad_value=114, interpolations='linear')])
# read rois and labels
roi_source = CTFDeserializer(roi_file)
roi_source.map_input(rois_stream_name, dim=rois_dim, format="dense")
label_source = CTFDeserializer(label_file)
label_source.map_input(labels_stream_name, dim=label_dim, format="dense")
# define a composite reader
rc = ReaderConfig([image_source, roi_source, label_source], epoch_size=sys.maxsize)
return rc.minibatch_source()

How to read data from numpy files in TensorFlow? [duplicate]

I have read the CNN Tutorial on the TensorFlow and I am trying to use the same model for my project.
The problem is now in data reading. I have around 25000 images for training and around 5000 for testing and validation each. The files are in png format and I can read them and convert them into the numpy.ndarray.
The CNN example in the tutorials use a queue to fetch the records from the file list provided. I tried to create my own such binary file by reshaping my images into 1-D array and attaching a label value in the front of it. So my data looks like this
[[1,12,34,24,53,...,105,234,102],
[12,112,43,24,52,...,115,244,98],
....
]
The single row of the above array is of length 22501 size where the first element is the label.
I dumped the file to using pickle and the tried to read from the file using the
tf.FixedLengthRecordReader to read from the file as demonstrated in example
I am doing the same things as given in the cifar10_input.py to read the binary file and putting them into the record object.
Now when I read from the files the labels and the image values are different. I can understand the reason for this to be that pickle dumps the extra information of braces and brackets also in the binary file and they change the fixed length record size.
The above example uses the filenames and pass it to a queue to fetch the files and then the queue to read a single record from the file.
I want to know if I can pass the numpy array as defined above instead of the filenames to some reader and it can fetch records one by one from that array instead of the files.
Probably the easiest way to make your data work with the CNN example code is to make a modified version of read_cifar10() and use it instead:
Write out a binary file containing the contents of your numpy array.
import numpy as np
images_and_labels_array = np.array([[...], ...], # [[1,12,34,24,53,...,102],
# [12,112,43,24,52,...,98],
# ...]
dtype=np.uint8)
images_and_labels_array.tofile("/tmp/images.bin")
This file is similar to the format used in CIFAR10 datafiles. You might want to generate multiple files in order to get read parallelism. Note that ndarray.tofile() writes binary data in row-major order with no other metadata; pickling the array will add Python-specific metadata that TensorFlow's parsing routines do not understand.
Write a modified version of read_cifar10() that handles your record format.
def read_my_data(filename_queue):
class ImageRecord(object):
pass
result = ImageRecord()
# Dimensions of the images in the dataset.
label_bytes = 1
# Set the following constants as appropriate.
result.height = IMAGE_HEIGHT
result.width = IMAGE_WIDTH
result.depth = IMAGE_DEPTH
image_bytes = result.height * result.width * result.depth
# Every record consists of a label followed by the image, with a
# fixed number of bytes for each.
record_bytes = label_bytes + image_bytes
assert record_bytes == 22501 # Based on your question.
# Read a record, getting filenames from the filename_queue. No
# header or footer in the binary, so we leave header_bytes
# and footer_bytes at their default of 0.
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
result.key, value = reader.read(filename_queue)
# Convert from a string to a vector of uint8 that is record_bytes long.
record_bytes = tf.decode_raw(value, tf.uint8)
# The first bytes represent the label, which we convert from uint8->int32.
result.label = tf.cast(
tf.slice(record_bytes, [0], [label_bytes]), tf.int32)
# The remaining bytes after the label represent the image, which we reshape
# from [depth * height * width] to [depth, height, width].
depth_major = tf.reshape(tf.slice(record_bytes, [label_bytes], [image_bytes]),
[result.depth, result.height, result.width])
# Convert from [depth, height, width] to [height, width, depth].
result.uint8image = tf.transpose(depth_major, [1, 2, 0])
return result
Modify distorted_inputs() to use your new dataset:
def distorted_inputs(data_dir, batch_size):
"""[...]"""
filenames = ["/tmp/images.bin"] # Or a list of filenames if you
# generated multiple files in step 1.
for f in filenames:
if not gfile.Exists(f):
raise ValueError('Failed to find file: ' + f)
# Create a queue that produces the filenames to read.
filename_queue = tf.train.string_input_producer(filenames)
# Read examples from files in the filename queue.
read_input = read_my_data(filename_queue)
reshaped_image = tf.cast(read_input.uint8image, tf.float32)
# [...] (Maybe modify other parameters in here depending on your problem.)
This is intended to be a minimal set of steps, given your starting point. It may be more efficient to do the PNG decoding using TensorFlow ops, but that would be a larger change.
In your question, you specifically asked:
I want to know if I can pass the numpy array as defined above instead of the filenames to some reader and it can fetch records one by one from that array instead of the files.
You can feed the numpy array to a queue directly, but it will be a more invasive change to the cifar10_input.py code than my other answer suggests.
As before, let's assume you have the following array from your question:
import numpy as np
images_and_labels_array = np.array([[...], ...], # [[1,12,34,24,53,...,102],
# [12,112,43,24,52,...,98],
# ...]
dtype=np.uint8)
You can then define a queue that contains the entire data as follows:
q = tf.FIFOQueue([tf.uint8, tf.uint8], shapes=[[], [22500]])
enqueue_op = q.enqueue_many([image_and_labels_array[:, 0], image_and_labels_array[:, 1:]])
...then call sess.run(enqueue_op) to populate the queue.
Another—more efficient—approach would be to feed records to the queue, which you could do from a parallel thread (see this answer for more details on how this would work):
# [With q as defined above.]
label_input = tf.placeholder(tf.uint8, shape=[])
image_input = tf.placeholder(tf.uint8, shape=[22500])
enqueue_single_from_feed_op = q.enqueue([label_input, image_input])
# Then, to enqueue a single example `i` from the array.
sess.run(enqueue_single_from_feed_op,
feed_dict={label_input: image_and_labels_array[i, 0],
image_input: image_and_labels_array[i, 1:]})
Alternatively, to enqueue a batch at a time, which will be more efficient:
label_batch_input = tf.placeholder(tf.uint8, shape=[None])
image_batch_input = tf.placeholder(tf.uint8, shape=[None, 22500])
enqueue_batch_from_feed_op = q.enqueue([label_batch_input, image_batch_input])
# Then, to enqueue a batch examples `i` through `j-1` from the array.
sess.run(enqueue_single_from_feed_op,
feed_dict={label_input: image_and_labels_array[i:j, 0],
image_input: image_and_labels_array[i:j, 1:]})
I want to know if I can pass the numpy array as defined above instead
of the filenames to some reader and it can fetch records one by one
from that array instead of the files.
tf.py_func, that wraps a python function and uses it as a TensorFlow operator, might help. Here's an example.
However, since you've mentioned that your images are stored in png files, I think the simplest solution would be to replace this:
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
result.key, value = reader.read(filename_queue)
with this:
result.key, value = tf.WholeFileReader().read(filename_queue))
value = tf.image.decode_jpeg(value)