Applying Tensorflow Dataset .map() to subsequent dataset elements - tensorflow

I've got a TFRecordDataset and I'm trying to preprocess the features of two subsequent elements by means of the map() API.
dataset_ext = dataset.map(lambda x: tf.py_function(parse_data, [x], [tf.float32]))
As map applies the function parse_data to every dataset element, I don't know what parse_data should look like in order to keep track of the feature extracted from the previous dataset element.
Can anyone help? Thank you
EDIT: I'm working on the Waymo dataset, so each element is a frame. You can refer to https://github.com/Jossome/Waymo-open-dataset-document for its structure.
This is my parse function parse_data:
from waymo_open_dataset import dataset_pb2 as open_dataset
def parse_data(input_data):
frame = open_dataset.Frame()
frame.ParseFromString(bytearray(input_data.numpy()))
av_speed = (frame.images[0].velocity.v_x, frame.images[0].velocity.v_y, frame.images[0].velocity.v_z)
return av_speed
I'd like to build a dataset whose features are the car speed and acceleration, defined as the speed variation between subsequent frames (the first value can be 0).
One way I thought about is to give the map function dataset and dataset.skip(1) as inputs but I'm not sure about it yet.

I am not sure but it might be unnecessary to make your mapped function a tf.py_function. How parse_data is supposed to look like depends on your dataset dataset_ext. If it has for example two file paths (1 instace of input data and 1 instance of output data), the mapping function should have 2 arguments and should return 2 arguments.
For example: if your dataset contains images and you want them to be randomly cropped each time an example of your dataset is drawn the mapping function looks like this:
def process_img_random_crop(img_in, img_out, output_shape):
merged = tf.stack([img_in, img_out])
mergedCrop = tf.image.random_crop(merged, size=(2,) + output_shape)
img_in_cropped, img_out_cropped = tf.unstack(mergedCrop, 2, 0)
return img_in_cropped, img_out_cropped
I call it as follows:
image_ds_test = image_ds_test.map(lambda i, o: process_img_random_crop(i, o, output_shape=(64, 64, 1)), num_parallel_calls=tf.data.experimental.AUTOTUNE)
What exactly is your plan with dataset_ext and what does it contain?
Edit:
Okay, got what you meant with you the two frames. So the map function is applied to each entry of your dataset separatly. If you need cross-entry information, a single entry of your dataset needs to contain two frames. With this more complicated set-up, I would suggest you to use a tensorflow Sequence: The explanation from the tensorflow team is pretty straigth forward. Hope this help!

Related

Tensorflow: How to shuffle a dataset so that it doesn't reshuffle after splitting

I am so confused as to why it's been so hard for me to find the answer to this. I want to be able to shuffle a dataset one time. After shuffling, I then split the dataset into train/val/test splits. I can't find a way to do this without the train/val/test data being all reshuffled together anytime I iterate over the split datasets.
I guess because the train/val/test dataset are all pointing to locations in a dataset which is being shuffled each time.
Here's an example of my code that is trying to do this.
dataset = tf.data.Dataset.from_tensor_slices((x, y))
dataset = dataset.shuffle(buffer_size=len(x))
train, val, test = split_tf_dataset(dataset, len(x), test_pct=0.1, val_pct=0.1)
train, val, test = train.batch(batch_size=50, drop_remainder=True), val.batch(batch_size=50, drop_remainder=True), test.batch(batch_size=50, drop_remainder=True)
'split_tf_dataset' is just performing take and skip operations, no randomness added there.
My workaround so far has been to shuffle the data before I create the Dataset, but does Dataset have this functionality that I'm missing? The option 'reshuffle_each_iteration' doesn't seem to do anything in this case.
I would expect setting reshuffle_each_iteration to False to fix this problem, however it seems to have no effect. I've also tried calling Dataset.sample_from_datasets, however with one dataset it only
bounces your input back to you, doing nothing.
This is the numpy code that does what I'm expecting tensorflow should be able to do:
x = x[np.random.choice(np.arange(0, len(x)), size=len(x))]

How do I get and use value from a tensor within a TF 2.0 Dataset map step?

I'm using TensorFlow Alpha 2.0.
I have TFRecords files I'm reading from, each one holding a short video clip with each frame encoded as jpeg byte string to save space:
{
'numframes': tf.io.FixedLenFeature([], tf.int64),
'frames': tf.io.VarLenFeature(tf.string)
}
I have a map step in my tf.data.Dataset pipeline that successfully parses each example:
def parse_tfrecord(p):
return tf.io.parse_single_example(p, example_schema)
My next step is to read out the number of frames from numframes and run the tf.io.decode_jpeg function on each frame in frames.values[i] with i being from range(numframes):
def parse_jpegs(p):
numframes = p['numframes']
return tf.map_fn(tf.io.decode_jpeg, [p['frames'].values[i] for i in range(numframes)])
My dataset pipeline for completeness:
def dataset():
dataset = tf.data.Dataset.list_files("*.tfrecord")
dataset = tf.data.TFRecordDataset(dataset)
dataset = dataset.shuffle(1000).repeat()
dataset = dataset.map(parse_tfrecord)
dataset = dataset.map(parse_jpegs)
return dataset
If I exclude the dataset.map(parse_jpegs) line it all works alright, showing me something like {'frames': <tensorflow.python.framework.sparse_tensor.SparseTensor at 0x7f394c285518>, 'numframes': <tf.Tensor: id=2937, shape=(), dtype=int64, numpy=25>}
(Note that the numframes tensor includes a numpy value of 25. I can get that outside my dataset pipeline with the tensor.numpy() method)
Within that map function though, I can't call .numpy() to get the value out of the tensor, and when printing the tensor itself it hasn't been evaluated or something because there is no value shown yet.
What is the best way to parse all these frames within the dataset pipeline?
EDIT: Error message I'm getting is TypeError: 'Tensor' object cannot be interpreted as an integer in parse_jpegs when trying to get numframes. This makes sense to me why a tensor can't be interpreted as an int, but how can I get the value from that tensor to use to set the range?
The problem I'm running into comes down to the fact that each "frames" object has a different number of frames. If I can apply tf.io.decode_jpeg to each frame in that list without needing to record number of frames separately I would be fine with that, but I have "numframes" here so I know how many frames need to be decoded in my "frames" list.
EDIT: I'll heave the question up for anyone else who might find it helpful, but I ended up just returning the raw bytestrings and doing the decode_jpeg in a separate generator function outside the dataset API. It was much easier that way, even if it might be slower.
In my specific case, I ended up finding out that map_fn was trying to turn my input tensor into an output tensor of the same type. In this case, tf.io.decode_jpeg takes in a string (of bytes) and outputs a uint8 array, which was causing problems. Another argument to tf.map_fn(... output_type=tf.uint8) seems to have fixed it for me! Maybe not exactly as written since I continued tinkering with it since asking the question, but I got it working now.

When working with batches via the Dataset API in Tensorflow what is the recommended way to perform index lookups in dictionary?

I am currently going about refactoring existing code over to the newer TF Dataset API. In our current process we populate a standard python dictionary with product ids to classification ids.
Now I have moved over our images/paths to a TF Dataset and then using tf.string_split I extract various information from the filename itself. One of them being the product_id. At this point the product_id is a tf tensor which I am unable to perform a lookup using our previous means via "if product_id in products_to_class" because I now have a tensor and I can't perform a search via the standard dictionary.
So I am using this project as a way to learn how to increase performance. So I wanted to know what the "best/recommended" approach is to take here when working with the tf Dataset API batches. Do I convert the product_id to a string and just perform the lookup via the current if check above or do I now go about converting the products_to_class dictionary to another data structure such as another Dataset and perform the lookup using tensors throughout? Any advice would be greatly appreciated.
Small example of what I have currently is:
prod_to_class = {'12345': 0, '67890': 1}
#Below logic is in a mapped function used on a TF.Dataset
def _parse_fn(filename, label)
core_file = tf.string_split([filename], '\\').values[-1]
product_id = tf.string_split([core_file], ".").values[0]
#unable to perform below because product_id is now a tensor and
#products_to_class is a python dictionary
if product_id in products_to_class:
label = products_to_class[product_id]
The built-in TensorFlow mechanism for doing this is to use a tf.contrib.lookup table. For example, if you have a list of string keys that you want to map to dense integers, you can define the following outside your _parse_fn():
# This constructor creates a lookup table that implicitly maps each string in the
# argument to its index in the list (e.g. '67890' -> 1).
products_to_class = tf.contrib.lookup.index_table_from_tensor(['12345', '67890'])
...and then use products_to_class.lookup() in your _parse_fn().
def _parse_fn(filename, label):
core_file = tf.string_split([filename], '\\').values[-1]
product_id = tf.string_split([core_file], ".").values[0]
# Returns a `tf.Tensor` that corresponds to the value associated with
# `product_id` in the `products_to_class` table.
label = products_to_class.lookup(product_id)
# ...
Note that this places two additional constraints on your program:
You must use Dataset.make_initializable_iterator() instead of Dataset.make_one_shot_iterator().
You must call sess.run(tf.tables_initializer()) before starting to consume elements from the input pipeline.
Both of these will be handled for you if you use the high-level tf.estimator API and return the tf.data.Dataset from your input_fn.

Mxnet Gluon custom data iterator

I have written a custom data iterator using mx.io.DataIter class. What's the easiest way to use this data iterator with Gluon interface?
I went through the documentation and couldn't find an easy way to do so. One of my idea was to use it as iterator and get data adn label from each batch as follows.
for e in range(epochs):
train_iter.reset()
for batch_data in train_iter:
data = nd.concatenate(([d for d in batch_data.data]))
label = nd.concatenate(([l for l in batch_data.label]))
with autograd.record():
output = net(data)
loss = softmax_cross_entropy(output, label)
loss.backward()
trainer.step(batch_size)
print(nd.mean(loss).asscalar())
But this may not be optimal as I need to concatenate per batch.
What's the optimal way to achieve this? i.e. is there a systematic
way to write a simple custom iterator for gluon?
How do I add context information in above cases?
I think your approach works. Basically you can get data from batch_data.data and label from batch_data.label and feed them into the network.
I'm not sure why you need to concat the data and labels - maybe it's to do with your network definition.
If you ever need to split the data and train on multiple GPUs, you can use the gluon.utils.split_and_load function to do that.

Adding statsmodels 'predict' results to a Pandas dataframe

It is common to want to append the results of predictions to the dataset used to make the predictions, but the statsmodels predict function returns (non-indexed) results of a potentially different length than the dataset on which predictions are based.
For example, if the test dataset, test, contains any null entries, then
mod_fit = sm.Logit.from_formula('Y ~ A B C', train).fit()
press = mod_fit.predict(test)
will produce an array that is shorter than the length of test, and cannot be usefully appended with
test['preds'] = preds
And since the result of predict is not indexed, there is no way to recover the rows to which the results should be attached.
What is the idiom for associating predict results to the rows from which they were generated? Is there, perhaps, a way to get predict to return a dataframe that preserves the indices of its argument?
Predict shouldn't drop any rows. Can you post a minimal working example where this happens? Preserving the pandas index is on my radar and should be fixed in master soon.
https://github.com/statsmodels/statsmodels/issues/1501
Edit: Nevermind. This is a known issue. https://github.com/statsmodels/statsmodels/issues/1352