TFRecords for embedded text data - tensorflow

For a project at Uni, I'm working on the implementation of a Question Answering (bAbI dataset Task 5 at the moment, see https://research.fb.com/downloads/babi/) system with Neural Nets in TensorFlow, and I want to use TFRecords for my Input Pipeline.
My idea is that one Example in TFRecords terms should consist of the context for the question, the question itself, the answer, and the supporting sentence number (int which points to the most important sentence in the context to be able to answer the question). Here is how I've defined the function:
def make_example(context, question, answer, support):
ex = tf.train.SequenceExample()
fl_context = ex.feature_lists.feature_list["context"]
fl_question = ex.feature_lists.feature_list["question"]
fl_answer = ex.feature_lists.feature_list["answer"]
ex.context.feature["support"].int64_list.value.append(support)
for token in context:
fl_context.feature.add().int64_list.value.append(token)
for qWord in question:
fl_question.feature.add().int64_list.value.append(qWord)
for ansWord in answer:
fl_answer.feature.add().int64_list.value.append(ansWord)
fl_support.feature.add().int64_list.value.append(support)
return ex
However, before passing the context, question, and answer, I want to embed the words and represent them by their GloVe vectors, i.e. by a (m,d) matrix, where m is the number of tokens in the sentence, and d is the number of dimensions each word vector has. This seems not to be handled well by my make_example function as I get:
theTypeError: (array([[ -9.58490000e-01, 1.73210000e-01,
2.51650000e-01,
-5.61450000e-01, -1.21440000e-01, 1.54350000e+00,
-1.28930000e+00, -9.77790000e-01, -1.35480000e-01,
-6.06930000e-01, -1.37810000e+00, 6.33470000e-01,
1.33160000e-01, 2.46320000e-01, 6.60260000e-01,
-4.46130000e-02, 4.09510000e-01, -7.61670000e-01,
4.67530000e-01, -6.67810000e-01, 2.99850000e-01,
-2.74810000e-01, -5.47990000e-01, -8.56820000e-01,
5.30880000e-02, -2.01700000e+00, 7.48530000e-01,
-1.27830000e-01, 1.32050000e-01, -2.19450000e-01,
2.29830000e+00, -3.17680000e-01, -8.64940000e-01,
-1.08630000e-01, -8.13770000e-02, -7.03420000e-01,
4.60000000e-01, -3.34730000e-01, 4.37030000e-02,
-7.55080000e-01, -6.89710000e-01, 7.14380000e-01,
-8.35950000e-02, 1.58620000e-02, -5.23850000e-01,
1.72520000e-01, -4.98740000e-01, 2.30810000e-01,
-3.64690000e-01, 1.5 has type <class 'tuple'>, but expected one of:
(<class 'int'>,)
Pointing to the fl_context.feature.add().int64_list.value.append(token) above... Could someone point out where I've misunderstood the concept of TFRecords, and give me an advice how to approach the problem?
I've searched a lot for learning materials, but usually the examples on TFRecords are with image data. So far my references are https://medium.com/#TalPerry/getting-text-into-tensorflow-with-the-dataset-api-ffb832c8bec6 and http://web.stanford.edu/class/cs20si/lectures/notes_09.pdf .
Thanks a lot in advance!

The solution to my question can be found here: https://github.com/simonada/q-and-a-tensorflow/blob/master/src/Q%26A%20with%20TF-%20TFRecords%20and%20Eager%20Execution.ipynb
My approach is as following:
Store the texts into a csv file: per row (context, question, answer)
Define a function to convert sequence to tf_example, in my case
def sequence_to_tf_example(context, question, answer):
context_ids= vectorize(context, False, word_to_index)
question_ids= vectorize(question, False, word_to_index)
answer_ids= vectorize(answer, True, word_to_index)
ex = tf.train.SequenceExample()
context_tokens = ex.feature_lists.feature_list["context"]
question_tokens = ex.feature_lists.feature_list["question"]
answer_tokens = ex.feature_lists.feature_list["answer"]
for token in context_ids:
context_tokens.feature.add().int64_list.value.append(token)
for token in question_ids:
question_tokens.feature.add().int64_list.value.append(token)
for token in answer_ids:
#print(token)
answer_tokens.feature.add().int64_list.value.append(token)
return ex
Define write functions
def write_example_to_tfrecord(context, question, answer, tfrecord_file, writer):
example= sequence_to_tf_example(context, question, answer)
writer.write(example.SerializeToString())
def write_data_to_tf_record(filename):
file_csv= filename+'.csv'
file_tfrecords= filename+'.tfrecords'
with open(file_csv) as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
next(readCSV) #skip header
writer= tf.python_io.TFRecordWriter(file_tfrecords)
for row in readCSV:
write_example_to_tfrecord(row[0], row[1], row[2], file_tfrecords, writer)
writer.close()
Define read functions
def read_from_tfrecord(ex):
sequence_features = {
"context": tf.FixedLenSequenceFeature([], dtype=tf.int64),
"question": tf.FixedLenSequenceFeature([], dtype=tf.int64),
"answer": tf.FixedLenSequenceFeature([], dtype=tf.int64)
}
# Parse the example (returns a dictionary of tensors)
_, sequence_parsed = tf.parse_single_sequence_example(
serialized=ex,
sequence_features=sequence_features
)
return {"context": sequence_parsed['context'], "question": sequence_parsed['question'],
"answer": sequence_parsed['answer']}
Create dataset
def make_dataset(path, batch_size=128):
'''
Makes a Tensorflow dataset that is shuffled, batched and parsed.
'''
# Read a tf record file. This makes a dataset of raw TFRecords
dataset = tf.data.TFRecordDataset([path])
# Apply/map the parse function to every record. Now the dataset is a bunch of dictionaries of Tensors
dataset = dataset.map(read_from_tfrecord)
#Shuffle the dataset
dataset = dataset.shuffle(buffer_size=10000)
# specify padding for each tensor seperatly
dataset = dataset.padded_batch(batch_size, padded_shapes={
"context": tf.TensorShape([None]),
"question": tf.TensorShape([None]),
"answer": tf.TensorShape([None])
})
return dataset

Related

How to read (decode) tfrecords with tf.data API

I have a custom dataset, that I then stored as tfrecord, doing
# toy example data
label = np.asarray([[1,2,3],
[4,5,6]]).reshape(2, 3, -1)
sample = np.stack((label + 200).reshape(2, 3, -1))
def bytes_feature(values):
"""Returns a TF-Feature of bytes.
Args:
values: A string.
Returns:
A TF-Feature.
"""
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[values]))
def labeled_image_to_tfexample(sample_binary_string, label_binary_string):
return tf.train.Example(features=tf.train.Features(feature={
'sample/image': bytes_feature(sample_binary_string),
'sample/label': bytes_feature(label_binary_string)
}))
def _write_to_tf_record():
with tf.Graph().as_default():
image_placeholder = tf.placeholder(dtype=tf.uint16)
encoded_image = tf.image.encode_png(image_placeholder)
label_placeholder = tf.placeholder(dtype=tf.uint16)
encoded_label = tf.image.encode_png(image_placeholder)
with tf.python_io.TFRecordWriter("./toy.tfrecord") as writer:
with tf.Session() as sess:
feed_dict = {image_placeholder: sample,
label_placeholder: label}
# Encode image and label as binary strings to be written to tf_record
image_string, label_string = sess.run(fetches=(encoded_image, encoded_label),
feed_dict=feed_dict)
# Define structure of what is going to be written
file_structure = labeled_image_to_tfexample(image_string, label_string)
writer.write(file_structure.SerializeToString())
return
However I cannot read it. First I tried (based on http://www.machinelearninguru.com/deep_learning/tensorflow/basics/tfrecord/tfrecord.html , https://medium.com/coinmonks/storage-efficient-tfrecord-for-images-6dc322b81db4 and https://medium.com/mostly-ai/tensorflow-records-what-they-are-and-how-to-use-them-c46bc4bbb564)
def read_tfrecord_low_level():
data_path = "./toy.tfrecord"
filename_queue = tf.train.string_input_producer([data_path], num_epochs=1)
reader = tf.TFRecordReader()
_, raw_records = reader.read(filename_queue)
decode_protocol = {
'sample/image': tf.FixedLenFeature((), tf.int64),
'sample/label': tf.FixedLenFeature((), tf.int64)
}
enc_example = tf.parse_single_example(raw_records, features=decode_protocol)
recovered_image = enc_example["sample/image"]
recovered_label = enc_example["sample/label"]
return recovered_image, recovered_label
I also tried variations casting enc_example and decoding it, such as in Unable to read from Tensorflow tfrecord file However when I try to evaluate them my python session just freezes and gives no output or traceback.
Then I tried using eager execution to see what is happening, but apparently it is only compatible with tf.data API. However as far as I understand transformations on tf.data API are made on the whole dataset. https://www.tensorflow.org/api_guides/python/reading_data mentions that a decode function must be written, but doesn't give an example on how to do that. All the tutorials I have found are made for TFRecordReader (which doesn't work for me).
Any help (pinpointing what I am doing wrong/ explaining what is happening/ indications on how to decode tfrecords with tf.data API) is highly appreciated.
According to https://www.youtube.com/watch?v=4oNdaQk0Qv4 and https://www.youtube.com/watch?v=uIcqeP7MFH0 tf.data is the best way to create input pipelines, so I am highly interested on learning that way.
Thanks in advance!
I am not sure why storing the encoded png causes the evaluation to not work, but here is a possible way of working around the problem. Since you mentioned that you would like to use the tf.data way of creating input pipelines, I'll show how to use it with your toy example:
label = np.asarray([[1,2,3],
[4,5,6]]).reshape(2, 3, -1)
sample = np.stack((label + 200).reshape(2, 3, -1))
First, the data has to be saved to the TFRecord file. The difference from what you did is that the image is not encoded to png.
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
writer = tf.python_io.TFRecordWriter("toy.tfrecord")
example = tf.train.Example(features=tf.train.Features(feature={
'label_raw': _bytes_feature(tf.compat.as_bytes(label.tostring())),
'sample_raw': _bytes_feature(tf.compat.as_bytes(sample.tostring()))}))
writer.write(example.SerializeToString())
writer.close()
What happens in the code above is that the arrays are turned into strings (1d objects) and then stored as bytes features.
Then, to read the data back using the tf.data.TFRecordDataset and tf.data.Iterator class:
filename = 'toy.tfrecord'
# Create a placeholder that will contain the name of the TFRecord file to use
data_path = tf.placeholder(dtype=tf.string, name="tfrecord_file")
# Create the dataset from the TFRecord file
dataset = tf.data.TFRecordDataset(data_path)
# Use the map function to read every sample from the TFRecord file (_read_from_tfrecord is shown below)
dataset = dataset.map(_read_from_tfrecord)
# Create an iterator object that enables you to access all the samples in the dataset
iterator = tf.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes)
label_tf, sample_tf = iterator.get_next()
# Similarly to tf.Variables, the iterators have to be initialised
iterator_init = iterator.make_initializer(dataset, name="dataset_init")
with tf.Session() as sess:
# Initialise the iterator passing the name of the TFRecord file to the placeholder
sess.run(iterator_init, feed_dict={data_path: filename})
# Obtain the images and labels back
read_label, read_sample = sess.run([label_tf, sample_tf])
The function _read_from_tfrecord() is:
def _read_from_tfrecord(example_proto):
feature = {
'label_raw': tf.FixedLenFeature([], tf.string),
'sample_raw': tf.FixedLenFeature([], tf.string)
}
features = tf.parse_example([example_proto], features=feature)
# Since the arrays were stored as strings, they are now 1d
label_1d = tf.decode_raw(features['label_raw'], tf.int64)
sample_1d = tf.decode_raw(features['sample_raw'], tf.int64)
# In order to make the arrays in their original shape, they have to be reshaped.
label_restored = tf.reshape(label_1d, tf.stack([2, 3, -1]))
sample_restored = tf.reshape(sample_1d, tf.stack([2, 3, -1]))
return label_restored, sample_restored
Instead of hard-coding the shape [2, 3, -1], you could also store that too into the TFRecord file, but for simplicity I didn't do it.
I made a little gist with a working example.
Hope this helps!

How to perform data augmentation in Tensorflow Estimator's input_fn

Using Tensorflow's Estimator API, at what point in the pipeline should I perform the data augmentation?
According to this official Tensorflow guide, one place to perform the data augmentation is in the input_fn:
def parse_fn(example):
"Parse TFExample records and perform simple data augmentation."
example_fmt = {
"image": tf.FixedLengthFeature((), tf.string, ""),
"label": tf.FixedLengthFeature((), tf.int64, -1)
}
parsed = tf.parse_single_example(example, example_fmt)
image = tf.image.decode_image(parsed["image"])
# augments image using slice, reshape, resize_bilinear
# |
# |
# |
# v
image = _augment_helper(image)
return image, parsed["label"]
def input_fn():
files = tf.data.Dataset.list_files("/path/to/dataset/train-*.tfrecord")
dataset = files.interleave(tf.data.TFRecordDataset)
dataset = dataset.map(map_func=parse_fn)
# ...
return dataset
My question
If I perform data augmentation inside input_fn, does parse_fn return a single example or a batch including the original input image + all of the augmented variants? If it should only return a single [augmented] example, how do I ensure that all images in the dataset are used in its un-augmented form, as well as all variants?
If you use iterators on your dataset, your _augment_helper function will be called with each iteration of the dataset across each block of data fed in ( as you are calling the parse_fn in dataset.map )
Change your code to
ds_iter = dataset.make_one_shot_iterator()
ds_iter = ds_iter.get_next()
return ds_iter
I've tested this with a simple augmentation function
def _augment_helper(image):
print(image.shape)
image = tf.image.random_brightness(image,255.0, 1)
image = tf.clip_by_value(image, 0.0, 255.0)
return image
Change 255.0 to whatever the maximum value is in your dataset, I used 255.0 as my example's data set was in 8 bit pixel values
It will return single examples for every call you make to the parse_fn, then if you use the .batch() operation it will return a batch of parsed images

tensorflow: Reading time series data from TFRecord

I'm using a SequenceExample protobuf to read/write time-series data into a TFRecord file.
I serialized a pair the np arrays as follows:
writer = tf.python_io.TFRecordWriter(file_name)
context = tf.train.Features( ... Feature( ... ) ... )
feature_data = tf.train.FeatureList(feature=[
tf.train.Feature(float_list=tf.train.FloatList(value=
np.random.normal(size=([4065000,]))])
labels = tf.train.FeatureList(feature=[
tf.train.Feature(int64_list=tf.train.Int64List(value=
np.random.random_integers(0,10,size=([1084,]))])
##feature_data and labels are of similar, but varying lengths
feature_list = {"feature_data": feature_data,
"labels": labels}
feature_lists = tf.train.FeatureLists(feature_list=feature_list)
example = tf.train.SequenceExample(context=context,
feature_lists=feature_lists)
## serialize and close
When trying to read the .tfrecords file, I've gotten quite a few errors, primarily because the SequenceExample protobuf writes the time series data as a series of values (e.g. value: -12.2549, value: -18.1372, .... value:13.1234). My code to read the .tfrecords file is as follows:
dataset = tf.data.TFRecordDataset("data/tf_record.tfrecords")
dataset = dataset.map(decode)
dataset = dataset.make_one_shot_iterator().get_next()
### reshape tensors and feed to estimator###
My decode() function is defined as follows:
def decode(serialized_proto):
context_features = {...}
sequence_features = {"feature_data": tf.FixedLenSequenceFeature((None,),
tf.float32),
"labels": tf.FixedLenSequenceFeature(((None,),
tf.int64)}
context, sequence = tf.parse_single_sequence_example(serialized_proto,
context_features=context_features,
sequence_features=sequence_features)
return context, sequence
One of the errors is as follows:
Shape [?] is not fully defined for 'ParseSingleSequenceExample/ParseSingleSequenceExample' (op: 'ParseSingleSequenceExample') with input shapes: [], [0], [], [], [], [], [], [], [].
My primary question is how to think about the structure of Datasets. I'm not sure I really understand the structure of the data returned. I'm having a hard time iterating through this Dataset and returning the variably-sized Tensors. Thanks in advance!
you can only use tf.FixedLenSequenceFeature when the shape of the feature is known. Otherwise, use tf.VarLenFeature instead.

TensorFlow - how to import data with multiple labels

I'm trying to create a model in TensorFlow which predicts ideal item for a user by predicting a vector of numbers.
I have created a dataset in Spark and saved it as a TFRecord using Spark TensorFlow connector.
In the dataset, I have several hundreds of features and 20 labels in each row. For easier manipulation, I have given every column a prefix 'feature_' or 'label_'.
Now I'm trying to write input function for TensorFlow, but I can't figure out how to parse the data.
So far I have written this:
def dataset_input_fn():
path = ['data.tfrecord']
dataset = tf.data.TFRecordDataset(path)
def parser(record):
example = tf.train.Example()
example.ParseFromString(record)
# TODO: no idea what to do here
# features = parsed["features"]
# label = parsed["label"]
# return features, label
dataset = dataset.map(parser)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(32)
dataset = dataset.repeat(100)
iterator = dataset.make_one_shot_iterator()
features, labels = iterator.get_next()
return features, labels
How can I split the Example into a feature set and a label set? I have tried to split the Example into two parts, but there is no way to even access it. The only way I have managed to access it is by printing the example out, which gives me something like this.
features {
...
feature {
key: "feature_wishlist_hour"
value {
int64_list {
value: 0
}
}
}
feature {
key: "label_emb_1"
value {
float_list {
value: 0.4
}
}
}
feature {
key: "label_emb_2"
value {
float_list {
value: 0.8
}
}
}
...
}
Your parser function should be similar to how you constructed the example proto. In your case its should be something similar to:
# example proto decode
def parser(example_proto):
keys_to_features = {'feature_wishlist_hour':tf.FixedLenFeature((), tf.int64),
'label_emb_1': tf.FixedLenFeature((), tf.float32),
'label_emb_2': tf.FixedLenFeature((), tf.float32)}
parsed_features = tf.parse_single_example(example_proto, keys_to_features)
return parsed_features['feature_wishlist_hour'], (parsed_features['label_emb_1'], parsed_features['label_emb_2'])
EDIT: From the comments it seems you are encoding each of the features as key, value pair, which is not right. Check this answer: Numpy to TFrecords: Is there a more simple way to handle batch inputs from tfrecords? on how to write it in a proper way.

TFRecords: Write list of tensors to single Example

I'm extracting features from images using a convolutional neural network. The network in question has three outputs (three output tensors), which differ in size. I want to store the extracted features in TFRecords, one Example for each image:
Example:
image_id: 1
features/fc8: [output1.1, output1.2, output1.3]
Example:
image_id: 2
features/fc8: [output2.1, output2.2, output2.3]
....
How can I achieve this structure using TFRecords?
EDIT: Elegant way is to use tf.SequenceExample.
Convert the data using tf.SequenceExample() format
def make_example(features, image_id):
ex = tf.train.SequenceExample()
ex.context.feature['image_id'].int64_list.value.append(image_id)
fl_features = ex.feature_lists.feature_list['features/fc8']
for feature in features:
fl_features.feature.add().bytes_list.value.append(frame.tostring())
return ex
Writing to TFRecord
def _convert_to_tfrecord(output_file, feature_batch, ids_batch):
writer = tf.python_io.TFRecordWriter(output_file)
for features, id in zip(feature_batch, ids_batch):
ex = make_example(features, id)
writer.write(ex.SerializeToString())
writer.close()
Parsing example
def parse_example_proto(example_serialized):
context_features = {
'image_id': tf.FixedLenFeature([], dtype=tf.int64)}
sequence_features = {
'features/fc8': tf.FixedLenSequenceFeature([], dtype=tf.string)}
context_parsed, sequence_parsed = tf.parse_single_sequence_example(
serialized=example_serialized,
context_features=context_features,
sequence_features=sequence_features)
return context_parsed['image_id'], sequence_features['features/fc8']
Note: The features here are saved in byte_list, you can also save it in float_list.
Another way, is to use tf.parse_single_example() by storing the examples as:
image_id: 1
features/fc8_1: output1.1
features/fc8_2: output1.2
features/fc8_3: output1.3