Keras input process with DataFrame variable length list of strings - tensorflow

I am trying to build a TF/Keras model that takes in sequential feature and scalar features. The training data is from a Pandas DataFrame. The sequential feature for one example can be considered as a list of strings(or words of different length) under one column of the DataFrame. The words themselves can be seen as categorical, the number of unique words being limited. I am wondering what is the right order and method to process data of this kind? Possible steps include mapping the string to integers, padding/truncating to a fixed length
I was planning to convert the sequential features and scalar features into tensors following https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers. Then put the sequential features into a LSTM and the scalar feature into a MLP and use a FCN to combine their outputs. I am stuck at the data process step.
I have tried using keras.layers.StringLookup to convert the string list feature into integer list. But it complains that the nparray cannot be converted to tensor. And I am wondering should I first convert the list of strings into a string Tensor and then convert it into a integer Tensor? And what is the right order and method to process data of this kind.

Yes, as a first step you can convert your list of strings to tensors. To convert a string to tensor, you can use "tf.constant" function. For example:
import tensorflow as tf
s = ["dog", "cat"]
ts = tf.constant(s)
print(ts)
You get:
tf.Tensor([b'dog' b'cat'], shape=(2,), dtype=string)
Then you can use StringLookup and CategoryEncoding like in function get_category_encoding_layer() on
https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers#categorical_columns

Related

When working with batches via the Dataset API in Tensorflow what is the recommended way to perform index lookups in dictionary?

I am currently going about refactoring existing code over to the newer TF Dataset API. In our current process we populate a standard python dictionary with product ids to classification ids.
Now I have moved over our images/paths to a TF Dataset and then using tf.string_split I extract various information from the filename itself. One of them being the product_id. At this point the product_id is a tf tensor which I am unable to perform a lookup using our previous means via "if product_id in products_to_class" because I now have a tensor and I can't perform a search via the standard dictionary.
So I am using this project as a way to learn how to increase performance. So I wanted to know what the "best/recommended" approach is to take here when working with the tf Dataset API batches. Do I convert the product_id to a string and just perform the lookup via the current if check above or do I now go about converting the products_to_class dictionary to another data structure such as another Dataset and perform the lookup using tensors throughout? Any advice would be greatly appreciated.
Small example of what I have currently is:
prod_to_class = {'12345': 0, '67890': 1}
#Below logic is in a mapped function used on a TF.Dataset
def _parse_fn(filename, label)
core_file = tf.string_split([filename], '\\').values[-1]
product_id = tf.string_split([core_file], ".").values[0]
#unable to perform below because product_id is now a tensor and
#products_to_class is a python dictionary
if product_id in products_to_class:
label = products_to_class[product_id]
The built-in TensorFlow mechanism for doing this is to use a tf.contrib.lookup table. For example, if you have a list of string keys that you want to map to dense integers, you can define the following outside your _parse_fn():
# This constructor creates a lookup table that implicitly maps each string in the
# argument to its index in the list (e.g. '67890' -> 1).
products_to_class = tf.contrib.lookup.index_table_from_tensor(['12345', '67890'])
...and then use products_to_class.lookup() in your _parse_fn().
def _parse_fn(filename, label):
core_file = tf.string_split([filename], '\\').values[-1]
product_id = tf.string_split([core_file], ".").values[0]
# Returns a `tf.Tensor` that corresponds to the value associated with
# `product_id` in the `products_to_class` table.
label = products_to_class.lookup(product_id)
# ...
Note that this places two additional constraints on your program:
You must use Dataset.make_initializable_iterator() instead of Dataset.make_one_shot_iterator().
You must call sess.run(tf.tables_initializer()) before starting to consume elements from the input pipeline.
Both of these will be handled for you if you use the high-level tf.estimator API and return the tf.data.Dataset from your input_fn.

A similar approach for LabelEncoder in sklearn.preprocessing?

For encoding categorical data like sex we normally use LabelEncorder() in scikit learn. But If I'm going to use Tensorflow instead of Scikit Learn, what is the equivalent function or methodology for doing such task? I know that we can do one hot encoding easily with tensorflow, but then it will create labels as 10 , 01 instead of 1 , 0.
There is a package in TensorFlow called tf.feature_columns, that contain 4 methods to create categorical columns from your input data:
categorical_column_with_hash_bucket(...): Hash the input value to a fixed number of categories
categorical_column_with_identity(...): If you have numeric input and you want the value itself to be treated as a categorical column
categorical_column_with_vocabulary_list(...): Outputs a category based on a fixed (memory) list of words
categorical_column_with_vocabulary_file(...): Same as _list but reads the vocabulary from file
The package also provides lots more way of getting your input data to the model. For an overview, see this blogpost written by the developers of the package.

tf.nn.embedding_lookup - row or column?

This is a very simple question. I'm learning tensorflow and converting my numpy-written code using Tensorflow.
I have word embedding matrix defined U = [embedding_size, vocab_size] therefore each column is the embedding vector of each word.
I converted U into TF like below:
U = tf.Variable(tf.truncated_normal([embedding_size, vocab_size], -0.1, 0.1))
So far, so good.
Now I need to look up each word's embedding for training. I assume it would be
tf.nn.embedding_lookup(U, word_index)
My question is because my embedding is a column vector, I need to look up like this U[:,x[t]] in numpy.
How does TF figure out it needs to return the row OR column by word_index?
What's the default? Row or column?
If it's a row vector, then do I need to transpose my embedding matrix?
https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup
doesn't mention this. If anyone could point me to right resource, I'd appreciate it.
If params is a single tensor, the tf.nn.embedding_lookup(params, ids) operation treats ids as the indices of rows in params. If params is a list of tensors or a partitioned variable, then ids still correspond to rows in those tensors, but the partition_strategy (either "div" or "mod") determines how the ids map to a particular row.
As Aaron suggests, it will probably be easiest to define your embedding U as having shape [vocab_size, embedding_size], so that you can use tf.nn.embedding_lookup() and related functions.
Alternatively, you can use the axis argument to tf.gather() to select columns from U:
embedding = tf.gather(U, word_index, axis=1)
U should be vocab_size x embedding_size, the transpose of what you have now.

How do output shape in cntk?

I write this code:
matrix = C.softmax(model).eval(data).
But matrix.shape, matrix.size give me errors. So I'm wondering, how can I output the shape of CNTK variable?
First note that eval() will not give you a CNTK variable, it will give you a numpy array (or a list of numpy arrays, see the next point).
Second, depending on the nature of the model it is possible that what comes out of eval() is not a numpy array but a list. The reason for this is that if the output is a sequence then CNTK cannot guarrantee that all sequences will be of the same length and it therefore returns a list of arrays, each array being one sequence.
Finally, if you truly have a CNTK variable, you can get the dimensions with .shape

Variable length dimension in tensor

I'm trying to implement the paper "End-to-End memory networks" (http://arxiv.org/abs/1503.08895)
Each training example consists of a number of phrases, a question and then the answer. The number of sentences is variable, as is the number of words in each sentence and the question. Each word is encoded as an integer. So my input would have the form [batch size, # of sentences, # words in sentence].
Now my problem is that the second and third dimension are unknown for each mini-batch. Can I still somehow represent this input as a single tensor or do I have to use lists of tensors, so that I have a list of length batch_size, and then a sublist of length number of sentences and then for each sentence a tensor, whose size is also not known in advance, corresponding to the words encoded as integers.
Can I use this second approach or will tensorflow then not be able to backpropagate, e.g. I have an operation where I have to calculate the following sum: \sum_i tf.scalar_mul(p_i, c_i), where p_i is a scalar and c_i is an embedding vector that was previously calculated. The tensors for the p and c values are then stored in a list, so I would have to sum over the elements in the two lists in a loop. I'm assuming that tensorflow would not be able to incoorporate this loop in the computation graph, correct? I'm sceptical since theano has a special scan function that allows one to loop over input, so I'm assuming that a regular loop would cause problems in the computation graph. How does tensorflow handle this?
Moving Yaroslav's comment to an answer:
TensorFlow has tf.scan. Dimensions may also be dynamic as in Theano.