TensorFlow placeholder for InputList - tensorflow

Some raw operations use InputLists, not (only) simple Inputs. I want to add a Placeholder to my Graph, and during TF_SessionRun add the actual array of tensors. I have two problems with it:
TF_SessionRun does not talk about InputList, it only knows Inputs. I assume (correct me if I am wrong), that from a TF_Session point of view, an InputList is just an Input (giving the first element of the array).
I cannot solve to have a Placeholder in the Graph. Defining Placeholder requires to give its data type, but in an InputList every Tensor can have its own data type.
I am looking either for a data type "DT_List" or similar indicating that the given Placeholder is a list of different tensors, OR looking for another raw operations, called "ListPlaceholder" or similar, to cater for this purpose.
How shall it be done?
P.S. Imagine raw operation Save. It's third parameter is an InputList of Tensors to save. I made a Graph that works well for a single Tensor, but I cannot solve it for multiple ones in one go.

It seems after a lot of checking, that I incorrectly guessed that there is (or should be) such a thing as an InputList input. The inputs to Session.Run are always single Tensors and as such no "Placeholder for list" exists. In the mentioned "Save" elementary operation, the "data" parameter - as guessed - has to be added using TF_AddInputList, but the list of TF_Outputs in its parameter list has to be assembled from individual TF_Output elements and cannot be retrieved as one TF_OutputList from a "Placeholder" like node.
If my conclusion is wrong, please correct me.

Related

Use of embeddings to preserve order invariance

I want to recommend an item complementary to a cart of items. So, naturally, I thought of using embeddings to represent items, and I came up to a layer of this kind in keras:
item_input = Input(shape=(MAX_CART_SIZE,), name="item_id")
item_embedding = Embedding(input_dim=NB_ITEMS+1, input_length=MAX_CART_SIZE, output_dim=EMBEDDING_SIZE, mask_zero=True)
I used masking to handle the variable size of the carts. So, the dimensions of the output tensor of this layer is MAX_CART_SIZE x EMBEDDING_SIZE. It means that there are as many different embeddings as there are potential items. In other words, a item can be encoded a different way according to its position within the cart and that's an undesirable behavior... Though, it seems that most neural networks dealing with NLP data work this way, with embeddings not associated with words but with words/indices within a phrase.
So, what would be the correct way to preserve order invariance? In other words, I'd like the cart A,B,C be stricly equivalent to the carts C,B,A or B,A,C in terms of input representation and generated output.
One way of having invariance will be done by using a Transformer architecture WITHOUT using positional embeddings. In this way, each item is encoded to an embedding, and because you do not have a positional embedding, the object embedding is the same even if it is one the first position or on the last one.
Moreover, the Transformer architecture is invariant to such positions as long as you avoid the positional embedding.

Confusion about how bucketized feature columns work

I had some confusion about how bucketized feature columns represent input to the model. According to the blog post on feature columns, when we bucketize a feature like year this puts each value in buckets based on the defined boundaries, and creates a binary vector, turning on each bucket based on the input value, but the example in the documentation shows the output as a single integer. I'm confused as to how the input is to the model when using a bucketized column. Can anyone clarify this for me please?
From the dimensions of the first hidden layer of the estimator, it seems like for each feature column that is a tf.feature_column.bucketized_column, a one hot encoded vector is created based on the boundaries.

Tensorflow word2vec InvalidArgumentError: Assign requires shapes of both tensors to match

I am using this code to train a word2vec model. I am trying to train it incrementally, with using saver.restore(). I am using new data after restoring the model. Since vocabulary size for the old data and new data are not the same, I got an exception like this:
InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [28908,200] rhs shape= [71291,200]
Here 71291 is vocabulary size for the old data and 28908 is for new data.
It gets the vocabulary words from the train_data file here, and constructs the network model using size of the vocabulary. I thought that if I could set vocabulary size the same for my old data and new data, I can solve this problem.
So, my question is: Can I do that in this code? As far as I understand, I cannot reach skipgram_word2vec() function.
Or, is there any other way of solving this issue in this code beside what I thought? If it is not possible using this code, I will try other ways for my purpose.
Any help is appreciated.
Having taken a look at the source of word2vec_optimized.py I'd say you will need to change the code there. It operates by opening a text file right up front as "training data". For your purposes, you have to change the build_graph method and allow it to get an option to set all that data ( words, counts, words_per_epoch, current_epoch, total_words_processed, examples, labels, opts.vocab_words, opts.vocab_counts, opts.words_per_epoch ) when initializing, and not from a text file.
Then you need to merge the two text files, and load them once, to produce the vocabulary. Then save all the data above, and use that to restore the network at each subsequent run.
If you use more than 2 texts, you need to include all the text you plan to use in the first data to produce the vocabulary, however.

TensorFlow shape checker

Unlike most programming languages, TensorFlow does not regard the shape of an array as part of the type. The downside of this is that, if you make a mistake and accidentally provide data of the wrong shape, it may silently give a wrong answer e.g. Slightly different shape converges to wrong number - why? which makes debugging difficult.
Does there exist a shape checker for TF? That is, a function or program that can analyze a graph (with sample feed_dict if need be) and raise the alarm if there is a shape mismatch?
Tensorflow does offer a shape checker mechanism which is technically the shape argument you should specify while declaring Tensorflow place holders. By default, tensorflow takes [None,None] for shape. But , for example if you do specify the shape while declaring your place holders, then it will raise shape error whenever user enters data of incorrect/conflicting shape. For example
lets say I declared a place holder named X and did specify its shape argument too:
X=tf.placeholder(dtype=tf.float32, shape=[None,256])
Now, this means that number of rows of X can vary but number of features will always be 256. And now , if I mistakenly feed data of shape lets say 1000 rows and 20 features, shape error will be raised.
Also, check this link :https://www.tensorflow.org/api_docs/python/tf/placeholder
Use:
print(str(tf.Shape(test_tensor))) # where test_tensor is
whatever your tensor's name is
Documentation available here: https://www.tensorflow.org/api_docs/python/tf/shape

Multiple outputs per input in Tensorflow

Is it possible to get the semantics of an unbounded arc in Tensorflow without directly enqueuing in the op itself?
For example, if I want to write an operation on that takes a scalar string my_string and "emits" tuples of ("string", num_occurrences_in_my_string), I have to resort to either of the following output options (as far as I know):
return the values necessary to construct a sparse Tensor
take a queue reference (of the correct type) and directly enqueue the input myself (like the tf.TextLineReader does)
As far as I can tell from the paper from Google on the Tensorflow "programming language", these are the only ways to accomplish it.
Is there a way in Tensorflow to emit an arbitrary number of output "rounds" per a given input besides the aforementioned workarounds?