tensorflow record with float numpy array - numpy

I want to create tensorflow records to feed my model;
so far I use the following code to store uint8 numpy array to TFRecord format;
def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def _floats_feature(value):
return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
def convert_to_record(name, image, label, map):
filename = os.path.join(params.TRAINING_RECORDS_DATA_DIR, name + '.' + params.DATA_EXT)
writer = tf.python_io.TFRecordWriter(filename)
image_raw = image.tostring()
map_raw = map.tostring()
label_raw = label.tostring()
example = tf.train.Example(features=tf.train.Features(feature={
'image_raw': _bytes_feature(image_raw),
'map_raw': _bytes_feature(map_raw),
'label_raw': _bytes_feature(label_raw)
}))
writer.write(example.SerializeToString())
writer.close()
which I read with this example code
features = tf.parse_single_example(example, features={
'image_raw': tf.FixedLenFeature([], tf.string),
'map_raw': tf.FixedLenFeature([], tf.string),
'label_raw': tf.FixedLenFeature([], tf.string),
})
image = tf.decode_raw(features['image_raw'], tf.uint8)
image.set_shape(params.IMAGE_HEIGHT*params.IMAGE_WIDTH*3)
image = tf.reshape(image_, (params.IMAGE_HEIGHT,params.IMAGE_WIDTH,3))
map = tf.decode_raw(features['map_raw'], tf.uint8)
map.set_shape(params.MAP_HEIGHT*params.MAP_WIDTH*params.MAP_DEPTH)
map = tf.reshape(map, (params.MAP_HEIGHT,params.MAP_WIDTH,params.MAP_DEPTH))
label = tf.decode_raw(features['label_raw'], tf.uint8)
label.set_shape(params.NUM_CLASSES)
and that's working fine. Now I want to do the same with my array "map" being a float numpy array, instead of uint8, and I could not find examples on how to do it;
I tried the function _floats_feature, which works if I pass a scalar to it, but not with arrays;
with uint8 the serialization can be done by the method tostring();
How can I serialize a float numpy array and how can I read that back?

FloatList and BytesList expect an iterable. So you need to pass it a list of floats. Remove the extra brackets in your _float_feature, ie
def _floats_feature(value):
return tf.train.Feature(float_list=tf.train.FloatList(value=value))
numpy_arr = np.ones((3,)).astype(np.float)
example = tf.train.Example(features=tf.train.Features(feature={"bytes": _floats_feature(numpy_arr)}))
print(example)
features {
feature {
key: "bytes"
value {
float_list {
value: 1.0
value: 1.0
value: 1.0
}
}
}
}

I will expand on the Yaroslav's answer.
Int64List, BytesList and FloatList expect an iterator of the underlying elements (repeated field). In your case you can use a list as an iterator.
You mentioned: it works if I pass a scalar to it, but not with arrays. And this is expected, because when you pass a scalar, your _floats_feature creates an array of one float element in it (exactly as expected). But when you pass an array you create a list of arrays and pass it to a function which expects a list of floats.
So just remove construction of the array from your function: float_list=tf.train.FloatList(value=value)

I've stumbled across this while working on a similar problem. Since part of the original question was how to read back the float32 feature from tfrecords, I'll leave this here in case it helps anyone:
If map.ravel() was used to input map of dimensions [x, y, z] into _floats_feature:
features = {
...
'map': tf.FixedLenFeature([x, y, z], dtype=tf.float32)
...
}
parsed_example = tf.parse_single_example(serialized=serialized, features=features)
map = parsed_example['map']

Yaroslav's example failed when a nd array was the input:
numpy_arr = np.ones((3,3)).astype(np.float)
I found that it worked when I used numpy_arr.ravel() as the input. But is there a better way to do it?

First of all, many thanks to Yaroslav and Salvador for their enlightening answers.
According to my experience, their methods only works when the input is a 1D NumPy array as the size of (n, ). When the input is a Numpy array with the dimension of more than 2, the following error info appears:
def _float_feature(value):
return tf.train.Feature(float_list=tf.train.FloatList(value=value))
numpy_arr = np.arange(12).reshape(2, 2, 3).astype(np.float)
example = tf.train.Example(features=tf.train.Features(feature={"bytes":
_float_feature(numpy_arr)}))
print(example)
TypeError: array([[0., 1., 2.],
[3., 4., 5.]]) has type numpy.ndarray, but expected one of: int, long, float
So, I'd like to expand on Tsuan's answer, that is, flattening the input before it was fed into the TF example. The modified code is as follows:
def _floats_feature(value):
return tf.train.Feature(float_list=tf.train.FloatList(value=value))
numpy_arr = np.arange(12).reshape(2, 2, 3).astype(np.float).flatten()
example = tf.train.Example(features=tf.train.Features(feature={"bytes":
_float_feature(numpy_arr)}))
print(example)
In addition, np.flatten() is more applicable than np.ravel().

Use tfrmaker, a TFRecord utility package. You can install the package with pip:
pip install tfrmaker
Then you could create tfrecords like this:
from tfrmaker import images
# mapping label names with integer encoding.
LABELS = {"bishop": 0, "knight": 1, "pawn": 2, "queen": 3, "rook": 4}
# specifiying data and output directories.
DATA_DIR = "datasets/chess/"
OUTPUT_DIR = "tfrecords/chess/"
# create tfrecords from the images present in the given data directory.
info = images.create(DATA_DIR, LABELS, OUTPUT_DIR)
# info contains a list of information (path: releative path, size: no of images in the tfrecord) about created tfrecords
print(info)
The package also has some cool features like:
dynamic resizing
splitting tfrecords into optimal shards
spliting training, validation, testing of tfrecords
count no of images in tfrecords
asynchronous tfrecord creation
NOTE: This package currently supports image datasets that are organised as directories with class names as sub directory names.

Related

Tensorflow v2.10 mutate output of signature function to be a map of label to results

I'm trying to save my model so that when called from tf-serving the output is:
{
"results": [
{ "label1": x.xxxxx, "label2": x.xxxxx },
{ "label1": x.xxxxx, "label2": x.xxxxx }
]
}
where label1 and label2 are my labels and x.xxxxx are the probability of that label.
This is what I'm trying:
class TFModel(tf.Module):
def __init__(self, model: tf.keras.Model) -> None:
self.labels = ['label1', 'label2']
self.model = model
#tf.function(input_signature=[tf.TensorSpec(shape=(1, ), dtype=tf.string)])
def prediction(self, pagetext: str):
return
{ 'results': tf.constant([{k: v for dct in [{self.labels[c]: f"{x:.5f}"} for (c,x) in enumerate(results[i])] for k, v in dct.items()}
for i in range(len(results.numpy()))])}
# and then save it:
tf_model_wrapper = TFModel(classifier_model)
tf.saved_model.save(tf_model_wrapper.model,
saved_model_path,
signatures={'serving_default':tf_model_wrapper.prediction}
)
Side Note: Apparently in TensorFlow v2.0 if signatures is omitted it should scan the object for the first #tf.function (according to this: https://www.tensorflow.org/api_docs/python/tf/saved_model/save) but in reality that doesn't seem to work. Instead, the model saves successfully with no errors and the #tf.function is not called, but default output is returned instead.
The error I get from the above is:
ValueError: Got a non-Tensor value <tf.Operation 'PartitionedCall' type=PartitionedCall> for key 'output_0' in the output of the function __inference_prediction_125493 used to generate the SavedModel signature 'serving_default'. Outputs for functions used as signatures must be a single Tensor, a sequence of Tensors, or a dictionary from string to Tensor.
I wrapped the result in tf.constant above because of this error, thinking it might be a quick fix, but I think it's me just being naive and not understanding Tensors properly.
I tried a bunch of other things before learning that [all outputs must be return values].1
How can I change the output to be as I want it to be?
You can see a Tensor as a multidimensional vector, i.e a structure with a fixed size and dimension and containing elements sharing the same type. Your return value is a map between a string and a list of dictionaries. A list of dictionaries cannot be converted to a tensor, because there is no guarantee that the number of dimensions and their size is constant, nor a guarantee that each element is sharing the same type.
You could instead return the raw output of your network, which should be a tensor and do your post processing outside of tensorflow-serving.
If you really want to do something like in your question, you can use a Tensor of strings instead, and you could use some code like that:
labels = tf.constant(['label1', 'label2'])
# if your batch size is dynamic, you can use tf.shape on your results variable to find it at runtime
batch_size = 32
# assuming your model returns something with the shape (N,2)
results = tf.random.uniform((batch_size,2))
res_as_str = tf.strings.as_string(results, precision=5)
return {
"results": tf.stack(
[tf.tile(labels[None, :], [batch_size, 1]), res_as_str], axis=-1
)
}
The output will be a dictionary mapping the value "results" to a Tensor of dimensions (Batch, number of labels, 2), the last dimension containing the label name and its corresponding value.

Writing and Reading lists to TFRecord example

I want to write a list of integers (or any multidimensional numpy matrix) to one TFRecords example. For both a single value or a list of multiple values I can creates the TFRecord file without error. I know also how to read the single value back from TFRecord file as shown in the below code sample I compiled from various sources.
# Making an example TFRecord
my_example = tf.train.Example(features=tf.train.Features(feature={
'my_ints': tf.train.Feature(int64_list=tf.train.Int64List(value=[5]))
}))
my_example_str = my_example.SerializeToString()
with tf.python_io.TFRecordWriter('my_example.tfrecords') as writer:
writer.write(my_example_str)
# Reading it back via a Dataset
featuresDict = {'my_ints': tf.FixedLenFeature([], dtype=tf.int64)}
def parse_tfrecord(example):
features = tf.parse_single_example(example, featuresDict)
return features
Dataset = tf.data.TFRecordDataset('my_example.tfrecords')
Dataset = Dataset.map(parse_tfrecord)
iterator = Dataset.make_one_shot_iterator()
with tf.Session() as sess:
print(sess.run(iterator.get_next()))
But how can I read back a list of values (e.g. [5,6]) from one example? The featuresDict defines the feature to be of type int64, and it fails when I have multiple values in it and I get below error:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Key: my_ints. Can't parse serialized Example.
You can achieve this by using tf.train.SequenceExample. I've edited your code to return both 1D and 2D data. First, you create a list of features which you place in a tf.train.FeatureList. We convert our 2D data to bytes.
vals = [5, 5]
vals_2d = [np.zeros((5,5), dtype=np.uint8), np.ones((5,5), dtype=np.uint8)]
features = [tf.train.Feature(int64_list=tf.train.Int64List(value=[val])) for val in vals]
features_2d = [tf.train.Feature(bytes_list=tf.train.BytesList(value=[val.tostring()])) for val in vals_2d]
featureList = tf.train.FeatureList(feature=features)
featureList_2d = tf.train.FeatureList(feature=features_2d)
In order to get the correct shape of our 2D feature we need to provide context (non-sequential data), this is done with a context dictionary.
context_dict = {'height': tf.train.Feature(int64_list=tf.train.Int64List(value=[vals_2d[0].shape[0]])),
'width': tf.train.Feature(int64_list=tf.train.Int64List(value=[vals_2d[0].shape[1]])),
'length': tf.train.Feature(int64_list=tf.train.Int64List(value=[len(vals_2d)]))}
Then you place each FeatureList in a tf.train.FeatureLists dictionary. Finally, this is placed in a tf.train.SequenceExample along with the context dictionary
my_example = tf.train.SequenceExample(feature_lists=tf.train.FeatureLists(feature_list={'1D':featureList,
'2D': featureList_2d}),
context = tf.train.Features(feature=context_dict))
my_example_str = my_example.SerializeToString()
with tf.python_io.TFRecordWriter('my_example.tfrecords') as writer:
writer.write(my_example_str)
To read it back into tensorflow you need to use tf.FixedLenSequenceFeature for the sequential data and tf.FixedLenFeature for the context data. We convert the bytes back to integers and we parse the context data in order to restore the correct shape.
# Reading it back via a Dataset
featuresDict = {'1D': tf.FixedLenSequenceFeature([], dtype=tf.int64),
'2D': tf.FixedLenSequenceFeature([], dtype=tf.string)}
contextDict = {'height': tf.FixedLenFeature([], dtype=tf.int64),
'width': tf.FixedLenFeature([], dtype=tf.int64),
'length':tf.FixedLenFeature([], dtype=tf.int64)}
def parse_tfrecord(example):
context, features = tf.parse_single_sequence_example(
example,
sequence_features=featuresDict,
context_features=contextDict
)
height = context['height']
width = context['width']
seq_length = context['length']
vals = features['1D']
vals_2d = tf.decode_raw(features['2D'], tf.uint8)
vals_2d = tf.reshape(vals_2d, [seq_length, height, width])
return vals, vals_2d
Dataset = tf.data.TFRecordDataset('my_example.tfrecords')
Dataset = Dataset.map(parse_tfrecord)
iterator = Dataset.make_one_shot_iterator()
with tf.Session() as sess:
print(sess.run(iterator.get_next()))
This will output the sequence of [5, 5] and the 2D numpy arrays. This blog post has a more in depth look at defining sequences with tfrecords https://dmolony3.github.io/Working%20with%20image%20sequences.html

Using tf,py_func with pickle files in Dataset API

I am trying to use the Dataset API with my dataset, which are pickle files. These files contains my data which is a vector of floats and the labels which is a one hot vector.
I have tried using the tf.py_func to load the features but I am unable to do it as I have missmatching shapes. As, I am these pickle files which includes the label as well, I can not give it directly to the tuple as the example here. So I am a bit lost on how to continue.
This is my code so far
path = "my_dir_to_pkl_files"
pkl_files = glob.glob((path+"*.pkl"))
dataset = tf.data.Dataset.from_tensor_slices((pkl_files))
dataset = dataset.map(
lambda filename: tuple(tf.py_func(
load_features, [filename], [tf.float32])))
And here is my python function to read the features.
def load_features(name):
decoded = name.decode("UTF-8")
if os.path.exists(decoded):
with open(decoded, 'rb') as f:
file = pickle.load(f)
return file['features']
# I have commented the line below but this should return
# the features and the label in a one hot vector
# return file['features'], file['targets']
else:
print("Something went wrong!")
exit(-1)
I would expect Dataset API to return a tuple with N features and 1 hot vector for each sample in my batch. Instead im getting
InvalidArgumentError: pyfunc_0 returns 30 values, but expects to see 1
values.
Any suggestions? Thanks.
Edit:
I show how my pickle file is. The features vector has a shape of [30,100]. I attach the same file as well here.
{'features': array([[0.64864044, 0.71419346, 0.35874235, ..., 0.66058507, 0.89013242,
0.67564707],
[0.15958826, 0.38115951, 0.46636267, ..., 0.49682084, 0.08863887,
0.17142761],
[0.26925915, 0.27901399, 0.91624607, ..., 0.30269212, 0.47494327,
0.43265325],
...,
[0.50405357, 0.7441127 , 0.04308265, ..., 0.06766902, 0.87449393,
0.31018099],
[0.44777562, 0.30836258, 0.48148097, ..., 0.74899213, 0.97264324,
0.43391464],
[0.50583501, 0.56803691, 0.61290449, ..., 0.8350931 , 0.52897295,
0.23731264]]), 'targets': array([0, 0, 1, 0])}
The error I got is after I try to get an element for the dataset
dataset.make_one_shot_iterator()
next_element = iterator.get_next()
print(sess.run(next_element))

How to read (decode) tfrecords with tf.data API

I have a custom dataset, that I then stored as tfrecord, doing
# toy example data
label = np.asarray([[1,2,3],
[4,5,6]]).reshape(2, 3, -1)
sample = np.stack((label + 200).reshape(2, 3, -1))
def bytes_feature(values):
"""Returns a TF-Feature of bytes.
Args:
values: A string.
Returns:
A TF-Feature.
"""
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[values]))
def labeled_image_to_tfexample(sample_binary_string, label_binary_string):
return tf.train.Example(features=tf.train.Features(feature={
'sample/image': bytes_feature(sample_binary_string),
'sample/label': bytes_feature(label_binary_string)
}))
def _write_to_tf_record():
with tf.Graph().as_default():
image_placeholder = tf.placeholder(dtype=tf.uint16)
encoded_image = tf.image.encode_png(image_placeholder)
label_placeholder = tf.placeholder(dtype=tf.uint16)
encoded_label = tf.image.encode_png(image_placeholder)
with tf.python_io.TFRecordWriter("./toy.tfrecord") as writer:
with tf.Session() as sess:
feed_dict = {image_placeholder: sample,
label_placeholder: label}
# Encode image and label as binary strings to be written to tf_record
image_string, label_string = sess.run(fetches=(encoded_image, encoded_label),
feed_dict=feed_dict)
# Define structure of what is going to be written
file_structure = labeled_image_to_tfexample(image_string, label_string)
writer.write(file_structure.SerializeToString())
return
However I cannot read it. First I tried (based on http://www.machinelearninguru.com/deep_learning/tensorflow/basics/tfrecord/tfrecord.html , https://medium.com/coinmonks/storage-efficient-tfrecord-for-images-6dc322b81db4 and https://medium.com/mostly-ai/tensorflow-records-what-they-are-and-how-to-use-them-c46bc4bbb564)
def read_tfrecord_low_level():
data_path = "./toy.tfrecord"
filename_queue = tf.train.string_input_producer([data_path], num_epochs=1)
reader = tf.TFRecordReader()
_, raw_records = reader.read(filename_queue)
decode_protocol = {
'sample/image': tf.FixedLenFeature((), tf.int64),
'sample/label': tf.FixedLenFeature((), tf.int64)
}
enc_example = tf.parse_single_example(raw_records, features=decode_protocol)
recovered_image = enc_example["sample/image"]
recovered_label = enc_example["sample/label"]
return recovered_image, recovered_label
I also tried variations casting enc_example and decoding it, such as in Unable to read from Tensorflow tfrecord file However when I try to evaluate them my python session just freezes and gives no output or traceback.
Then I tried using eager execution to see what is happening, but apparently it is only compatible with tf.data API. However as far as I understand transformations on tf.data API are made on the whole dataset. https://www.tensorflow.org/api_guides/python/reading_data mentions that a decode function must be written, but doesn't give an example on how to do that. All the tutorials I have found are made for TFRecordReader (which doesn't work for me).
Any help (pinpointing what I am doing wrong/ explaining what is happening/ indications on how to decode tfrecords with tf.data API) is highly appreciated.
According to https://www.youtube.com/watch?v=4oNdaQk0Qv4 and https://www.youtube.com/watch?v=uIcqeP7MFH0 tf.data is the best way to create input pipelines, so I am highly interested on learning that way.
Thanks in advance!
I am not sure why storing the encoded png causes the evaluation to not work, but here is a possible way of working around the problem. Since you mentioned that you would like to use the tf.data way of creating input pipelines, I'll show how to use it with your toy example:
label = np.asarray([[1,2,3],
[4,5,6]]).reshape(2, 3, -1)
sample = np.stack((label + 200).reshape(2, 3, -1))
First, the data has to be saved to the TFRecord file. The difference from what you did is that the image is not encoded to png.
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
writer = tf.python_io.TFRecordWriter("toy.tfrecord")
example = tf.train.Example(features=tf.train.Features(feature={
'label_raw': _bytes_feature(tf.compat.as_bytes(label.tostring())),
'sample_raw': _bytes_feature(tf.compat.as_bytes(sample.tostring()))}))
writer.write(example.SerializeToString())
writer.close()
What happens in the code above is that the arrays are turned into strings (1d objects) and then stored as bytes features.
Then, to read the data back using the tf.data.TFRecordDataset and tf.data.Iterator class:
filename = 'toy.tfrecord'
# Create a placeholder that will contain the name of the TFRecord file to use
data_path = tf.placeholder(dtype=tf.string, name="tfrecord_file")
# Create the dataset from the TFRecord file
dataset = tf.data.TFRecordDataset(data_path)
# Use the map function to read every sample from the TFRecord file (_read_from_tfrecord is shown below)
dataset = dataset.map(_read_from_tfrecord)
# Create an iterator object that enables you to access all the samples in the dataset
iterator = tf.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes)
label_tf, sample_tf = iterator.get_next()
# Similarly to tf.Variables, the iterators have to be initialised
iterator_init = iterator.make_initializer(dataset, name="dataset_init")
with tf.Session() as sess:
# Initialise the iterator passing the name of the TFRecord file to the placeholder
sess.run(iterator_init, feed_dict={data_path: filename})
# Obtain the images and labels back
read_label, read_sample = sess.run([label_tf, sample_tf])
The function _read_from_tfrecord() is:
def _read_from_tfrecord(example_proto):
feature = {
'label_raw': tf.FixedLenFeature([], tf.string),
'sample_raw': tf.FixedLenFeature([], tf.string)
}
features = tf.parse_example([example_proto], features=feature)
# Since the arrays were stored as strings, they are now 1d
label_1d = tf.decode_raw(features['label_raw'], tf.int64)
sample_1d = tf.decode_raw(features['sample_raw'], tf.int64)
# In order to make the arrays in their original shape, they have to be reshaped.
label_restored = tf.reshape(label_1d, tf.stack([2, 3, -1]))
sample_restored = tf.reshape(sample_1d, tf.stack([2, 3, -1]))
return label_restored, sample_restored
Instead of hard-coding the shape [2, 3, -1], you could also store that too into the TFRecord file, but for simplicity I didn't do it.
I made a little gist with a working example.
Hope this helps!

TensorFlow - how to import data with multiple labels

I'm trying to create a model in TensorFlow which predicts ideal item for a user by predicting a vector of numbers.
I have created a dataset in Spark and saved it as a TFRecord using Spark TensorFlow connector.
In the dataset, I have several hundreds of features and 20 labels in each row. For easier manipulation, I have given every column a prefix 'feature_' or 'label_'.
Now I'm trying to write input function for TensorFlow, but I can't figure out how to parse the data.
So far I have written this:
def dataset_input_fn():
path = ['data.tfrecord']
dataset = tf.data.TFRecordDataset(path)
def parser(record):
example = tf.train.Example()
example.ParseFromString(record)
# TODO: no idea what to do here
# features = parsed["features"]
# label = parsed["label"]
# return features, label
dataset = dataset.map(parser)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(32)
dataset = dataset.repeat(100)
iterator = dataset.make_one_shot_iterator()
features, labels = iterator.get_next()
return features, labels
How can I split the Example into a feature set and a label set? I have tried to split the Example into two parts, but there is no way to even access it. The only way I have managed to access it is by printing the example out, which gives me something like this.
features {
...
feature {
key: "feature_wishlist_hour"
value {
int64_list {
value: 0
}
}
}
feature {
key: "label_emb_1"
value {
float_list {
value: 0.4
}
}
}
feature {
key: "label_emb_2"
value {
float_list {
value: 0.8
}
}
}
...
}
Your parser function should be similar to how you constructed the example proto. In your case its should be something similar to:
# example proto decode
def parser(example_proto):
keys_to_features = {'feature_wishlist_hour':tf.FixedLenFeature((), tf.int64),
'label_emb_1': tf.FixedLenFeature((), tf.float32),
'label_emb_2': tf.FixedLenFeature((), tf.float32)}
parsed_features = tf.parse_single_example(example_proto, keys_to_features)
return parsed_features['feature_wishlist_hour'], (parsed_features['label_emb_1'], parsed_features['label_emb_2'])
EDIT: From the comments it seems you are encoding each of the features as key, value pair, which is not right. Check this answer: Numpy to TFrecords: Is there a more simple way to handle batch inputs from tfrecords? on how to write it in a proper way.