How to use feature_column v2 in Tensorflow (TF-Ranking) - tensorflow

I'm using TF-Ranking to train a recommendation engine. I have encountered a problem that seems to be a version incompatibility issue concerning tf.feature_column API.
The short version of my question is: What is a v2 feature column (TF 2.0?) (see this for instance) and how can I ensure that my feature columns are treated as v2, while I'm still using TF 1.14.
Here is the details:
I'm unable to shorten my code sufficiently to provide a reproducible example. But I will try to describe the problem in words.
TF Version: 1.14
OS: Ubuntu 18.04
I initialy had two features in my model, user and item, both sparse categorical features which were wrapped in their own tf.feature_column.embedding_column. I was able to use the train_and_evaluate method of the Estimator and export the model for serving.
Then I added a new feature curr_item which is only present during prediction (as a context feature). This shares the embeddings with item. So now I have a tf.feature_column.shared_embedding_columns which wraps both item and current_item.
Now calling train_and_evaluate results in the following error (shortened messages):
ValueError: Could not load all requested variables from checkpoint. Please make sure your model_fn does not expect variables that were not saved in the checkpoint.
Key input_layer/user_embedding/embedding_weights not found in checkpoint
Note that calling train method only works fine. My understanding is that once it gets to evaluation, it tries to load the variables from the checkpoint, but that variable doesn't exist. I did a little debugging and found the reason:
When encode_listwise_features is called during training (which in turn calls encode_features) all features (user and item) are "V2" (not sure what that means) and so the following if statement holds:
https://github.com/tensorflow/ranking/blob/31fc134816cc4974a46a11e7bb2df0066d0a88f0/tensorflow_ranking/python/feature.py#L92
and both variables are named with an encoding_layer prefix (scope name?):
encoding_layer/user_embedding/embedding_weights
encoding_layer/item_embedding/embedding_weights
But when I call the same function for all three features (a little confused wether this is in eval or predict mode), some of these are not "V2" and we end up in the else part of the above condition which calls input_layer direcetly and variables are named using input_layer prefix. Now TF is trying to restore
input_layer/user_embedding/embedding_weights
from the check-point, but that name doesn't exist in the checkpoint, because it was called
encoding_layer/user_embedding/embedding_weights
in training.
So:
1) How can I ensure that all my features are treated as v2 at all stages? I tried using tf.compat.v2.feature_column but that didn't help. There is already a ToDo note above that if statement for this.
2) Can the encode_feature be modified to avoid this situation? e.g. raise an exception with a helpful message?

Related

create_training_graph() failed when converted MobileFacenet to quantize-aware model with TF-lite

I am trying to quantize MobileFacenet (code from sirius-ai) according to the suggestion
and I think I met the same issue as this one
When I add tf.contrib.quantize.create_training_graph() into training graph
(train_nets.py ln.187: before train_op = train(...) or in train() utils/common.py ln.38 before gradients)
It did not add quantize-aware ops into the graph to collect dynamic range max\min.
I assume that I should see some additional nodes in tensorboard, but I did not, thus I think I did not successfully add quantize-aware ops in training graph.
And I try to trace tensorflow, found that I got nothing with _FindLayersToQuantize().
However when I add tf.contrib.quantize.create_eval_graph() to refine the training graph. I can see some quantize-aware ops as act_quant...
Since I did not add ops in training graph successfully, I have no weights to load in eval graph.
Thus I got some error message as
Key MobileFaceNet/Logits/LinearConv1x1/act_quant/max not found in checkpoint
or
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value MobileFaceNet/Logits/LinearConv1x1/act_quant/max
Does anyone know how to fix this error? or how to get quantized MobileFacenet with good accuracy?
Thanks!
H,
Unfortunately, the contrib/quantize tool is now deprecated. It won't be able to support newer models, and we are not working on it anymore.
If you are interested in QAT, I would recommend trying the new TF/Keras QAT API. We are actively developing that and providing support for it.

Can't save save/export and load a keras model that uses eager execution

I'm following the RNN text-generation tutorial with eager execution pretty much line for line. I've trained the model with my own data set and have saved a low loss checkpoint. I'm able to load the weights and generate text but I want to export/save the model so that I can learn how to deploy one using flask. However I can't figure out how. The version I'm using is '1.14.0-rc1'.
The tutorial: https://www.tensorflow.org/tutorials/sequences/text_generation
I have been able to save the model as an HDF5 file but I cannot load it. I've also disabled eager execution but that causes problems with running the code later on. I have tried the following and a few more snippets but those led to nothing as well:
new_model = keras.models.load_model("/content/gdrive/My Drive/ColabNotebooks/ckpt4/my_model.h5")
How ever I get
RuntimeError: tf.placeholder() is not compatible with eager execution.
Lastly I found this in another post and tried it as well but was met with another error:
tf.saved_model.save(model, "/content/gdrive/My Drive/Colab Notebooks/ckpt4/my_model.h5")
error:
AssertionError: Tried to export a function which references untracked object Tensor("StatefulPartitionedCall/args_2:0", shape=(), dtype=resource).TensorFlow objects (e.g. tf.Variable) captured by functions must be tracked by assigning them to an attribute of a tracked object or assigned to an attribute of the main object directly.

Using saved model for prediction in tensorflow

I use this code to restore my model, but I don't know how to predict after restoring it, which function can I use? I'm a beginner in tensorflow, I have no idea to which parameters or function will be saved.
In the meta model:
sess = tf.Session()
saver = tf.train.import_meta_graph("/home/MachineLearning/model.ckpt.meta")
saver.restore(sess,tf.train.latest_checkpoint('./'))
print("Model restored with success ")
x_predict,y_predict= load_svmlight_file('/MachineLearning/to_predict.csv')
x_predict = x_valid.toarray()
sess.run([] ,feed_dict ) #i don't know how to use predict function
These are the results:
$python predict.py
Model restored with success
Traceback (most recent call last):
File "predict.py", line 23, in <module>
sess.run([] ,feed_dict )
NameError: name 'feed_dict' is not defined
You're almost there. Tensorflow is simply a math library. Your graph is a collection of math operations with the associated dependencies (e.g. a graph, DAG specifically).
When you loaded the graph and associated variables (weights) you loaded all the definitions. Now you need to ask tensorflow to compute some value in the graph. There are lots of values it could compute, the one you want is often named logits (a typical name for the output layer of a neural network). But note that it could be named anything (especially if this isn't a neural network model), you need to understand the model. You might also want to compute an operation named accuracy which is defined to compute the accuracy of a particular batch of inputs (again depends on your model).
Note that you will need to provide tensorflow with whatever it needs to perform these computations. There is generally a placeholder where you pass in your data (and during training a placeholder for your labels which you don't need for prediction because none of the operations you will ask tensorflow to compute depend on it).
But you will need to get references to these various operations (logits, and accuracy) and placeholders (x is a typical name). Since you loaded your graph from disk you don't have the references (note that an alternative way of loading the model is to re-run the code that builds the model, which gives you easy access to the references you need).
In order to get the right references you can look them up by name. Here's how you would get a list of all the operations:
List of tensor names in graph in Tensorflow
Then to get a specific OP (operation) by name:
How to get a tensorflow op by name?
So you'll have something like this:
logits = tf.get_default_graph().get_operation_by_name("logits:0")
x = tf.get_default_graph().get_operation_by_name("x:0")
accuracy = tf.get_default_graph().get_operation_by_name("accuracy:0")
Note that the :0 is an index added to all names in tensorflow to avoid duplicate names. Now you have all the references you need and you can use sess.run to perform a specific computation, providing the input data, and OPs you'd like to have computed:
sess.run([logits, accuracy], feed_dict={x:your_input_data_in_numpy_format})
The names of these elements will vary in your implementation, I've used the most common names. If they weren't given pretty names it'll be hard to identify them and you'll need to look through the original code that produced the graph. In fact if they weren't named properly looking them up by name is so painful that it's probably better to just re-run the code that produced the original graph rather than import the meta graph. Notice that saver.restore only restores the actual data, import_meta_graph is the optional piece which can be replaced by simply re-building the graph programmatically.

how to properly train TensorFlow on one machine and evaluate on another?

I'm training a TensorFlow (1.2) model on one machine and attempting to evaluate it on another. Everything works fine when I stay local to one machine.
I am not using placeholders and feed-dict's to get data to the model but rather TF file queues and batch generators. I suspect with placeholders this would be much easier but I am trying to make the TF batch generator machinery work.
In my evaluation code I have lines like:
saver = tf.train.Saver()
ckpt = tf.train.get_checkpoint_state(os.path.dirname(ckpt_dir))
if ckpt and ckpt.model_checkpoint_path:
saver.restore(sess, ckpt.model_checkpoint_path)
This produces errors like:
017-08-16 12:29:06.387435: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Unsuccessful TensorSliceReader constructor: Failed to get matching files on /data/perdue/minerva/tensorflow/models/11/20170816/checkpoints-20: Not found: /data/perdue/minerva/tensorflow/models/11/20170816
The referenced directory (/data/...) exists on my training machine but not the evaluation machine. I have tried things like
saver = tf.train.import_meta_graph(
'/local-path/checkpoints-XXX.meta',
clear_devices=True
)
saver.restore(
sess, '/local-path/checkpoints-XXX',
)
but this produces a different error:
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value train_file_queue/limit_epochs/epochs
or, if I explicitly call the initializer functions immediately after the restore,
AttributeError: 'Tensor' object has no attribute 'initializer'
Here, train_file_queue/limit_epochs/epochs is an element of the training graph that I would like the evaluation function to ignore (I have another, new element test_file_queue that is pointing at a different file queue with the evaluation data files in it).
I think in the second case when I'm calling the initializers right after the restore that there is something in the local variables that won't doesn't work quite like a "normal" Tensor, but I'm not sure exactly what the issue is.
If I just use a generic Saver and restore TF does the right thing on the original machine - it just restores model parameters and then uses my new file queue for evaluation. But I can't be restricted to that machine, I need to be able to evaluate the model on other machines.
I've also tried freezing a protobuf and a few other options and there are always difficulties associated with the fact that I need to use file queues as the most upstream inputs.
What is the proper way to train using TensorFlow's file queues and batch generators and then deploy the model on a different machine / in a different environment? I suspect if I were using feed-dict's to get data to the graph this would be fairly simple, but it isn't as clear when using the built in file queues and batch generators.
Thanks for any comments or suggestions!
At least part of the answer to this dilemma was answered in TF 1.2 or 1.3. There is a new flag for the Saver() constructor:
saver = tf.train.Saver(save_relative_paths=True)
that makes it such that when you save the checkpoint directory and move it to another machine, and use it to restore() a model, everything works without errors relating to nonexistent paths for the data (the paths from the old machine where training was performed).
It isn't clear my use of the API is really idiomatic in this case, but at least the code works such that I can export trained models from one machine to another.

How to properly freeze a tensorflow graph containing a LookupTable

I am working with a model that uses multiple lookup tables to transform the model input from text to feature ids. I am able to train the model fine. I am able to load it via the javacpp bindings. I am using a default Saver object via the tensor flow supervisor on a periodic basis.
When I try to run the model I get the following error:
Table not initialized.
[[Node: hash_table_Lookup_3 = LookupTableFind[Tin=DT_STRING, Tout=DT_INT64,
_class=["loc:#string_to_index_2/hash_table"], _output_shapes=[[-1]],
_device="/job:localhost/replica:0/task:0/cpu:0"]
(string_to_index_2/hash_table, ParseExample/ParseExample:5, string_to_index_2/hash_table/Const)]]
I prepare the model by using the freeze_graph.py script as follows:
bazel-bin/tensorflow/python/tools/freeze_graph --input_graph=/tmp/tf/graph.pbtxt
--input_checkpoint=/tmp/tf/model.ckpt-0 --output_graph=/tmp/ticker_classifier.pb
--output_node_names=sigmoid --initializer_nodes=init_all_tables
As far as I can tell specifying the initializer_nodes has no effect on the resulting file. Am I running into something that is not currently supported? If not than is there something else I need to do to prepare the graph to be frozen?
I had the same problem when using C++ to invoke TF API to run the inference. It seems the reason is I train a model using tf.feature_column.categorical_column_with_hash_bucket, which needs to be initialized like this:
table_init_op = tf.tables_initializer(name="init_all_tables")
sess.run(table_init_op)
So when you want to freeze the model, you must append the name of table_init_op to the argument "--output_node_names":
freeze_graph --input_graph=/tmp/tf/graph.pbtxt
--input_checkpoint=/tmp/tf/model.ckpt-0
-- output_graph=/tmp/ticker_classifier.pb
--output_node_names=sigmoid,init_all_tables
--initializer_nodes=init_all_tables
When you load and init model in C++, you should first invoke TF C++ API like this:
std::vector<Tensor> dummy_outputs;
Status st = session->Run({}, {}, {"init_all_tables"}, dummy_outputs);
Now you have initialized all tables and can do other things such as inference. This issue may give you a help.