How to traverse examples in tf.data.Dataset backward? - tensorflow

Usually when using tensorflow dataset api, assuming there is not shuffling, tensorflow will retrieve the first serialized example of the tfrecords, and then proceeds sequentially to retrieve the remaining examples. Therefore, is there a way to start from the last example, and then proceed backwards?
Any help is much appreciated!!

Related

Using dynamically generated data with keras

I'm training a neural network using keras but I'm not sure how to feed the training data into the model in the way that I want.
My training data set is effectively infinite, I have some code to generate training examples as needed, so I just want to pipe a continuous stream of novel data into the network. keras seems to want me to specify my entire dataset in advance by creating a numpy array with everything in it, but this obviously wont work with my approach.
I've experimented with creating a generator class based on keras.utils.Sequence which seems like a better fit, but it still requires me to specify a length via the __len__ method which makes me think it will only create that many examples before recycling them. Can someone suggest a better approach?

Filtering examples in Tensorflow Extended (TFX)

In the dataset I am working on, there are a lot of data points that I want to filter, e.g. contain nan values, out-of-bound values, etc. I also want to do the same filtering on the data points in the inference time. Can I do it with TFX?
Currently I am filtering them before TFX stages, similar to the examples here. The caveat of this approach is that the filtering can't be automatically replicated during inference time. I have implemented some TFX transformations and I love it that these transformations can be automatically replicated by calling TFX transform graph layer, so I am thinking if I can do the same thing to filter out the invalid data points. I think the blocker I faced was that TFX needs to know the expected tensor shape (because of the TF graph computation) and with filtering, we wouldn't be able to know the expected output tensor shape.
Thank you!

TensorFlow Federated - How to work with SparseTensors

I am using TensorFlow Federated to simulate a scenario in which clients hosted on a remote server can work with our very sparse dataset in a federated setting.
Presently, the code is capable of running with a small subset of the very sparse dataset being loaded on the server-side and passing it to the remote workers hosted on another device. The data is in SVM Light format and can be loaded through sklearn's load_svmlight_file function, but needs to be converted into Tensors to work within tff. The current solution to do so involves converting the very sparse data into a dense array, then setting it up through the tf.data.Dataset.from_tensor_slices function for use with a keras model (following existing examples for tff).
This works, but takes up significant memory resources and is not suitable for the dataset as it cannot be run remotely for more than six samples due to the sparse data's serialized size, nor locally with more than a few hundred samples due to the size in memory.
To mitigate this, I converted the data into SparseTensors, but this approach fails due to the tff.learning.from_keras_model function expecting a pair of TensorSpec input_spec values, not a SparseTensorSpec input_spec with the labels being TensorSpec.
So, are there any concrete examples or known methods to work with SparseTensors within keras models in tff? Or must they be as Tensors for now? The data loads fine when not converted to regular Tensors so I will need to find a solution for working with the sparse data.
If there is presently no way to do so, are there examples of strategies within tff to work with very small subsets of data at a time, either being loaded directly with the remote client or being passed from the server?
Thanks!
I'd say the best approach now is to work with the TF's representation of tf.SparseTensor. That is, a tuple of 3 tensors, indices, values and dense_shape.
So when the problem is with Keras requiring the input to not be sparse tensors, you can pass in the input as for instance a dictionary consisting of these three tensors, which you convert to tf.sparse.SparseTensor as part of your tf.data pipeline.
See also this tutorial which I think is doing something related to what you are looking for, and please ask more detailed questions if needed!

Return both instance id and prediction from predict() method of a Keras model

Supposing I have a Keras model which is already trained. When using predict() method, I want to get the instance key value and corresponding prediction at the same time( I can pass key value as a feature/column in the input).
I wonder is it realistic to do that?
I struggled with this for a while. I'm using the tf.data.Dataset infrastructure so my first approach was to see if I could ensure that the order of the examples produced by the datasets was deterministic, but that wasn't optimal because it gave up a bunch of the parallel processing performance benefits and ended up not being the case in any event. I ended up processing predictions using model.predict_on_batch feeding in batches iterated out of the dataset manually instead of feeding the entire dataset into model.predict. That way I was able to grab the ids from the batch and associate them with the returned prediction.
I was surprised there wasn't a more ready made solution to a problem that must come up a lot. I haven't gotten up to speed on the Estimator interface or custom training/prediction loops yet, but hopefully this problem becomes trivial there.

What is the need to do sharding of TFRecords files?

Why TFRecords file is sharded in the inception model example in TensorFlow ?
For randomness, can't the list of files be shuffled before creating one single TFRecord file ?
Why TFRecords file is sharded in the inception model example in TensorFlow ?
According to object detection API, there are two advantages in sharding your dataset:
Files can be read in parallel, improving data loading speed
Examples can be shuffled better by sharding
You probably already knew the second point as it is in your second question:
For randomness, can't the list of files be shuffled before creating one single TFRecord file ?
Shuffling the dataset before creating the record is indeed a good practice because shuffling a TFRecord can only be done partially. Indeed, you can only load a certain number of examples in memory. The shuffling is then done by selecting randomly your next example among the ones in memory. You can see more in this question
However, if you only shuffle the dataset when creating the record, your network will always see examples in the same order in the successive training epochs. This might result in unwanted convergence behaviours due to the random order that was given once and for all. It is thus more interesting to shuffle the dataset on the fly to have different orderings in different epochs.
Sharding your dataset can easen the shuffling. Instead to be forced to always read in the same order from the same one file, you can start to read a bit from each file, choosing at random.