When working with batches via the Dataset API in Tensorflow what is the recommended way to perform index lookups in dictionary? - tensorflow

I am currently going about refactoring existing code over to the newer TF Dataset API. In our current process we populate a standard python dictionary with product ids to classification ids.
Now I have moved over our images/paths to a TF Dataset and then using tf.string_split I extract various information from the filename itself. One of them being the product_id. At this point the product_id is a tf tensor which I am unable to perform a lookup using our previous means via "if product_id in products_to_class" because I now have a tensor and I can't perform a search via the standard dictionary.
So I am using this project as a way to learn how to increase performance. So I wanted to know what the "best/recommended" approach is to take here when working with the tf Dataset API batches. Do I convert the product_id to a string and just perform the lookup via the current if check above or do I now go about converting the products_to_class dictionary to another data structure such as another Dataset and perform the lookup using tensors throughout? Any advice would be greatly appreciated.
Small example of what I have currently is:
prod_to_class = {'12345': 0, '67890': 1}
#Below logic is in a mapped function used on a TF.Dataset
def _parse_fn(filename, label)
core_file = tf.string_split([filename], '\\').values[-1]
product_id = tf.string_split([core_file], ".").values[0]
#unable to perform below because product_id is now a tensor and
#products_to_class is a python dictionary
if product_id in products_to_class:
label = products_to_class[product_id]

The built-in TensorFlow mechanism for doing this is to use a tf.contrib.lookup table. For example, if you have a list of string keys that you want to map to dense integers, you can define the following outside your _parse_fn():
# This constructor creates a lookup table that implicitly maps each string in the
# argument to its index in the list (e.g. '67890' -> 1).
products_to_class = tf.contrib.lookup.index_table_from_tensor(['12345', '67890'])
...and then use products_to_class.lookup() in your _parse_fn().
def _parse_fn(filename, label):
core_file = tf.string_split([filename], '\\').values[-1]
product_id = tf.string_split([core_file], ".").values[0]
# Returns a `tf.Tensor` that corresponds to the value associated with
# `product_id` in the `products_to_class` table.
label = products_to_class.lookup(product_id)
# ...
Note that this places two additional constraints on your program:
You must use Dataset.make_initializable_iterator() instead of Dataset.make_one_shot_iterator().
You must call sess.run(tf.tables_initializer()) before starting to consume elements from the input pipeline.
Both of these will be handled for you if you use the high-level tf.estimator API and return the tf.data.Dataset from your input_fn.

Related

Keras input process with DataFrame variable length list of strings

I am trying to build a TF/Keras model that takes in sequential feature and scalar features. The training data is from a Pandas DataFrame. The sequential feature for one example can be considered as a list of strings(or words of different length) under one column of the DataFrame. The words themselves can be seen as categorical, the number of unique words being limited. I am wondering what is the right order and method to process data of this kind? Possible steps include mapping the string to integers, padding/truncating to a fixed length
I was planning to convert the sequential features and scalar features into tensors following https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers. Then put the sequential features into a LSTM and the scalar feature into a MLP and use a FCN to combine their outputs. I am stuck at the data process step.
I have tried using keras.layers.StringLookup to convert the string list feature into integer list. But it complains that the nparray cannot be converted to tensor. And I am wondering should I first convert the list of strings into a string Tensor and then convert it into a integer Tensor? And what is the right order and method to process data of this kind.
Yes, as a first step you can convert your list of strings to tensors. To convert a string to tensor, you can use "tf.constant" function. For example:
import tensorflow as tf
s = ["dog", "cat"]
ts = tf.constant(s)
print(ts)
You get:
tf.Tensor([b'dog' b'cat'], shape=(2,), dtype=string)
Then you can use StringLookup and CategoryEncoding like in function get_category_encoding_layer() on
https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers#categorical_columns

Applying Tensorflow Dataset .map() to subsequent dataset elements

I've got a TFRecordDataset and I'm trying to preprocess the features of two subsequent elements by means of the map() API.
dataset_ext = dataset.map(lambda x: tf.py_function(parse_data, [x], [tf.float32]))
As map applies the function parse_data to every dataset element, I don't know what parse_data should look like in order to keep track of the feature extracted from the previous dataset element.
Can anyone help? Thank you
EDIT: I'm working on the Waymo dataset, so each element is a frame. You can refer to https://github.com/Jossome/Waymo-open-dataset-document for its structure.
This is my parse function parse_data:
from waymo_open_dataset import dataset_pb2 as open_dataset
def parse_data(input_data):
frame = open_dataset.Frame()
frame.ParseFromString(bytearray(input_data.numpy()))
av_speed = (frame.images[0].velocity.v_x, frame.images[0].velocity.v_y, frame.images[0].velocity.v_z)
return av_speed
I'd like to build a dataset whose features are the car speed and acceleration, defined as the speed variation between subsequent frames (the first value can be 0).
One way I thought about is to give the map function dataset and dataset.skip(1) as inputs but I'm not sure about it yet.
I am not sure but it might be unnecessary to make your mapped function a tf.py_function. How parse_data is supposed to look like depends on your dataset dataset_ext. If it has for example two file paths (1 instace of input data and 1 instance of output data), the mapping function should have 2 arguments and should return 2 arguments.
For example: if your dataset contains images and you want them to be randomly cropped each time an example of your dataset is drawn the mapping function looks like this:
def process_img_random_crop(img_in, img_out, output_shape):
merged = tf.stack([img_in, img_out])
mergedCrop = tf.image.random_crop(merged, size=(2,) + output_shape)
img_in_cropped, img_out_cropped = tf.unstack(mergedCrop, 2, 0)
return img_in_cropped, img_out_cropped
I call it as follows:
image_ds_test = image_ds_test.map(lambda i, o: process_img_random_crop(i, o, output_shape=(64, 64, 1)), num_parallel_calls=tf.data.experimental.AUTOTUNE)
What exactly is your plan with dataset_ext and what does it contain?
Edit:
Okay, got what you meant with you the two frames. So the map function is applied to each entry of your dataset separatly. If you need cross-entry information, a single entry of your dataset needs to contain two frames. With this more complicated set-up, I would suggest you to use a tensorflow Sequence: The explanation from the tensorflow team is pretty straigth forward. Hope this help!

In tensorflow, how can I index a feature's value given another feature as a key?

I have a feature preprocessing problem that is just a bit too complex for me to solve.
I want to generate a “cross-feature” out of 3 others, let me detail:
My ML problem is to recommend item to users.
In my example there are features about the users and features about the item. I’m trying to predict if the user will like this item or not.
We use tensorflow examples.
On of my user’s features is a “map” of item ids to the user’s “affinity” for them. Let’s call it “map of item affinities”
The affinity itself is computed by another process.
Since there is no map type in Tensorflow Examples, we have 2 features: one is the ordered list of item ids, the other is the ordered list of affinities. They are synchronous. So my “map of item affinities” is actually represented by to features item_affinities_ids and item affinities.
Yes, I’m using items affinity information as an input and try to predict another item affinity. But those are different, the input is computed for a different product use-case than the one I am trying to predict.
I also have a 3rd feature which is the the item_id of the item I’m trying to compute a new affinity for.
In naive numpy I could do it like this:
item_name = np.array(["item-a"])
item_affinities_ids = np.array(["item-0", "item-a", "item-b"])
item_affinities = np.array([0.2, 0.3, 0.4])
indices = np.where(item_affinities_ids == item_name)
return item_affinities[indices]
Now, where things can get more complicated is in real life:
I want a tensorflow implementation (TFT or native TF).
We use TF v.13
The “map of item affinities” can be missing. So the two resulting item_affinities_ids and item affinities are represented as SparseTensors. However if one is there, the other too, and they are guaranteed to be synchronous (same size, same order).
We do prediction and training on batches of examples, so the first dimension of each of my (Sparse)Tensors is the batch_size > 1.
The item_id may not exists in the “map of item affinities”. In that case I want a default value (0.0).
I’m looking for a tensorflow implementation that would deal with all of these requirements.
So far I have:
# using constants for the demonstration. In real life it would be tensors.
item_name = tf.constant([["item-a"], ["item-3"]])
item_affinities_ids = tf.constant([["item-0", "item-a", "item-b"], ["item-2", "item-1", "item-3"]])
item_affinities = tf.constant([[0.2, 0.3, 0.4], [0.2, -0.9, 0.4]])
return tf.boolean_mask(item_affinities, tf.equal(item_affinities_ids, item_name))
But it does not handle SparseTensor and the case when the item_id is not in the item_affinities list.
I’m looking for anyone to help me with that.

Creating a new random operation in computation graph of Tensorflow

How can I create a new random operator (something like tf.random_normal) which is part of the graph? I want to add Cauchy random variable to the output of one layer of my network.
I found tf.contrib.distributions.Cauchy, but how can I make it work inside a layer ( as we have with random_normal)?

sklearn: get feature names after L1-based feature selection

This question and answer demonstrate that when feature selection is performed using one of scikit-learn's dedicated feature selection routines, then the names of the selected features can be retrieved as follows:
np.asarray(vectorizer.get_feature_names())[featureSelector.get_support()]
For example, in the above code, featureSelector might be an instance of sklearn.feature_selection.SelectKBest or sklearn.feature_selection.SelectPercentile, since these classes implement the get_support method which returns a boolean mask or integer indices of the selected features.
When one performs feature selection via linear models penalized with the L1 norm, it's unclear how to accomplish this. sklearn.svm.LinearSVC has no get_support method and the documentation doesn't make clear how to retrieve the feature indices after using its transform method to eliminate features from a collection of samples. Am I missing something here?
For sparse estimators you can generally find the support by checking where the non-zero entries are in the coefficients vector (provided the coefficients vector exists, which is the case for e.g. linear models)
support = np.flatnonzero(estimator.coef_)
For your LinearSVC with l1 penalty it would accordingly be
from sklearn.svm import LinearSVC
svc = LinearSVC(C=1., penalty='l1', dual=False)
svc.fit(X, y)
selected_feature_names = np.asarray(vectorizer.get_feature_names())[np.flatnonzero(svc.coef_)]
I've been using sklearn 15.2, and according to LinearSVC documentation , coef_ is an array, shape = [n_features] if n_classes == 2 else [n_classes, n_features].
So first, np.flatnonzero doesn't work for multi-class. You'll have index out of range error. Second, it should be np.where(svc.coef_ != 0)[1] instead of np.where(svc.coef_ != 0)[0] . 0 is index of classes, not features. I ended up with using np.asarray(vectorizer.get_feature_names())[list(set(np.where(svc.coef_ != 0)[1]))]