Loss tensor being pruned out of graph in PopART - ipu

I’ve written a very simple PopART program using the C++ interface, but every time I try to compile it to run on an IPU device I get the following error:
terminate called after throwing an instance of ‘popart::error’
what(): Could not find loss tensor ‘L1:0’ in main graph tensors
I’m defining the loss in my program like so:
auto loss = builder->aiGraphcoreOpset1().l1loss({outputs[0]}, 0.1f, popart::ReductionType::Sum, “l1LossVal”);
Is there something wrong with my loss definition that’s resulting in it being pruned out of the graph? I’ve followed the same structure as one of the Graphcore examples here.

This error usually happens when the model protobuf you pass to the TrainingSession or InferenceSession objects doesn’t contain the loss tensor. A common reason for this is when you call builder->getModelProto() before you add the loss tensor to the graph. To ensure your loss tensor is part of the protobuf your calls should be in the following order:
...
auto loss = builder->aiGraphcoreOpset1().l1loss(...);
auto proto = builder->getModelProto();
auto session = popart::TrainingSession::createFromOnnxModel(...);
...
The key point is that the getModelProto() call should be the last call from the builder interface before setting up the PopART session.

Related

Tensorflow Saving Error (from Tensorflow Example)

I am trying to use the Basic Text Classification example from Tensorflow on my own dataset. Training and verification have gone well and I am to the point in the tutorial for exporting the model. The model compiles and works on an array of strings.
After that, I'd like to save the model in h5 format for use in other projects. At this point, the tutorial refers you to save and load keras models tutorial.
This second tutorial essentially says to do this:
model.save('path/saved_model.h5')
This fails with
ValueError: Weights for model sequential_X have not yet been created. Weights are created when the Model is first called on inputs or build() is called with an input_shape.
So next I attempt to do this:
model.build((None, max_features))
model.save('path/saved_model.h5')
There are several errors with this:
ValueError: Tensor conversion requested dtype string for Tensor with dtype float32: <tf.Tensor 'Placeholder:0' shape=(None, 45000) dtype=float32>
TypeError: Input 'input' of 'StringLower' Op has type float32 that does not match expected type of string.
ValueError: You cannot build your model by calling build if your layers do not support float type inputs. Instead, in order to instantiate and build your model, call your model on real tensor data (of the correct dtype).
I think this essentially means the input I defined to pass into model.build defaults to float and needs to be string. I think I have two options:
Somehow define my input layer to be string, which I cannot see how to do. This feels like the correct thing to do.
Use model.call. However I am not sure how to 'call my model on real tensor data' because tensors can't be strings and that is the input to the network.
I've seen one other person with this issue here, with no solution other than to rebuild the model in functional style with mixed results. I am not sure of the point of rebuilding in the functional style since I don't fully understand the problem.
I'd prefer to have the TextVectorization layer built into the final model to simplify deployment. This is exactly the reason the docs give for doing this in the example in the first place. (The model will save without it.)
I am a novice with this so I might be making a simple mistake. How can I get this model to save?

Outputting multiple loss components to tensorboard from tensorflow estimators

I am pretty new to tensorflow and I am struggling to get tensorboard to display some of my custom metrics. The model I am working with is a tf.estimator.Estimator, with an associated EstimatorSpec. The first new metric I am trying to log is from my loss function, which is composed of two components: a loss for an age prediction (tf.float32) and a loss for a class prediction (one-hot/multiclass), which I add together to determine a total loss (my model is predicting both a class and an age). The total loss is output just fine during training and shows up on tensorboard, but I would like to track the individual age and the class prediction loss components as well.
I think a solution that is supposed to work is to add a eval_metric_ops argument to the EstimatorSpec as described here (Custom eval_metric_ops in Estimator in Tensorflow). I have not been able to make this approach work, however. I defined a custom metric function that looks like this:
def age_loss_function(labels, ages_pred, ages_true):
per_sample_age_loss = get_age_loss_per_sample(ages_pred, ages_true) ### works fine
#### The error happens on this line:
mean_abs_age_diff, age_loss_update_fn = tf.metrics.Mean(per_sample_age_loss)
######
return mean_abs_age_diff, age_loss_update_fn
eval_metric_ops = {"age_loss": age_loss_function} #### Want to use this in EstimatorSpec
The instructions seem to say that I need both the error metric and the update function which should both be returned from the tf.metrics command as in examples like the one I linked. But this command fails for me with the error message:
tensorflow.python.framework.errors_impl.OperatorNotAllowedInGraphError: using a `tf.Tensor` as a Python `bool` is not allowed in Graph execution. Use Eager execution or decorate this function with #tf.function.
I am probably just misusing the APIs. If someone can guide me on the proper usage I would really appreciate it. Thanks!
It looks like the problem was from a version change. I had updated to tensorflow 2.0 while the instructions I was following were from 1.X. Using tf.compat.v1.metrics.mean() instead gets past this problem.

Number of layers in the NN model keeps increasing every time I call a new instance of the model

I am working for a while now with tensorflow and jupyter but this is the first time I have encountered this problem. I have a NN model which has 6 layers and I get an instance of this model by calling the function "classifier"
def classifier(input_repr,prob,reuse=None):
e_l1=tf.layers.dense(inputs=input_repr,units=512,activation=tf.nn.leaky_relu)
e_l1=tf.nn.dropout(e_l1,prob)
e_l2=tf.layers.dense(inputs=e_l1,units=256,activation=tf.nn.leaky_relu)
e_l2=tf.nn.dropout(e_l2,prob)
e_l3=tf.layers.dense(inputs=e_l2,units=128,activation=tf.nn.leaky_relu)
e_l3=tf.nn.dropout(e_l3,prob)
e_l4=tf.layers.dense(inputs=e_l3,units=64,activation=tf.nn.leaky_relu)
e_l4=tf.nn.dropout(e_l4,prob)
e_l5=tf.layers.dense(inputs=e_l4,units=32,activation=tf.nn.leaky_relu)
e_l5=tf.nn.dropout(e_l5,prob)
d_l3=tf.layers.dense(inputs=e_l5,units=1,activation=tf.nn.leaky_relu)
return d_l3
I also have a function to visualize the model summary as
def model_summary():
model_vars = tf.trainable_variables()
slim.model_analyzer.analyze_vars(model_vars, print_info=True)
print(model_summary())
And i get the model instance as,
model_output=classifier(input_repr,prob)
The problem is whenever I call this, and then i call model_summary() the layers gets stacked up to the previous model. For eg, if when i call the "classifier()" for the first time, model_Summary() shows 5 layers, but when i call it again it shows 10 layers and so on. I always initialize again before calling the classifier() method but it just happens again and again. I don't know if that is some issue with jupyter. The only way I know to solve this problem is to completely restart the kernel, which leads to loss of variables.
Don't forget to reset the default graph tf.reset_default_graph() before creating your model. The issue is that the notebook is running in a single thread, and Tensorflow stacks new nodes on the graph whenever you build the graph over and over again. That's why when prototyping in Jupyter notebook, always reset the default graph when you start build a new graph.
every time you call function classifier you create additional layers, when you will create your model and compile it only use model object to model.fit and model.predict

Using cleverhans with just model weights and no model class

I am using a pretrained model that someone else has created, they have only released the model weights. Currently I am importing the model weights into my graph and calling them by the tensor names. However, this seems incompatible with cleverhans' code that seems to require a model object which has the method predict.
Is there any work around for this which does not require me to rewrite most of the cleverhans attacks because I do not have the model class and predict method?
What you are describing should be possible but may be somewhat intensive on resources because it may recreate the graph several times. Basically, you can implement a CleverHans model class that takes in a graph checkpoint in the init method. The get_logits or fprop method should take an input tensor and load the graph to obtain the corresponding output tensor by performing some graph surgery to replace the checkpoint graph's input tensor with your own tensor: see the input_map argument in `tf.import_graph_de: https://www.tensorflow.org/api_docs/python/tf/graph_util/import_graph_def

How do you create a dynamic_rnn with dynamic "zero_state" (Fails with Inference)

I have been working with the "dynamic_rnn" to create a model.
The model is based upon a 80 time period signal, and I want to zero the "initial_state" before each run so I have setup the following code fragment to accomplish this:
state = cell_L1.zero_state(self.BatchSize,Xinputs.dtype)
outputs, outState = rnn.dynamic_rnn(cell_L1,Xinputs,initial_state=state, dtype=tf.float32)
This works great for the training process. The problem is once I go to the inference, where my BatchSize = 1, I get an error back as the rnn "state" doesn't match the new Xinputs shape. So what I figured is I need to make "self.BatchSize" based upon the input batch size rather than hard code it. I tried many different approaches, and none of them have worked. I would rather not pass a bunch of zeros through the feed_dict as it is a constant based upon the batch size.
Here are some of my attempts. They all generally fail since the input size is unknown upon building the graph:
state = cell_L1.zero_state(Xinputs.get_shape()[0],Xinputs.dtype)
.....
state = tf.zeros([Xinputs.get_shape()[0], self.state_size], Xinputs.dtype, name="RnnInitializer")
Another approach, thinking the initializer might not get called until run-time, but still failed at graph build:
init = lambda shape, dtype: np.zeros(*shape)
state = tf.get_variable("state", shape=[Xinputs.get_shape()[0], self.state_size],initializer=init)
Is there a way to get this constant initial state to be created dynamically or do I need to reset it through the feed_dict with tensor-serving code? Is there a clever way to do this only once within the graph maybe with an tf.Variable.assign?
The solution to the problem was how to obtain the "batch_size" such that the variable is not hard coded.
This was the correct approach from the given example:
Xinputs = tf.placeholder(tf.int32, (None, self.sequence_size, self.num_params), name="input")
state = cell_L1.zero_state(Xinputs.get_shape()[0],Xinputs.dtype)
The problem is the use of "get_shape()[0]", this returns the "shape" of the tensor and takes the batch_size value at [0]. The documentation doesn't seem to be that clear, but this appears to be a constant value so when you load the graph into an inference, this value is still hard coded (maybe only evaluated at graph creation?).
Using the "tf.shape()" function, seems to do the trick. This doesn't return the shape, but a tensor. So this seems to be updated more at run-time. Using this code fragment solved the problem of a training batch of 128 and then loading the graph into TensorFlow-Service inference handling a batch of just 1.
Xinputs = tf.placeholder(tf.int32, (None, self.sequence_size, self.num_params), name="input")
batch_size = tf.shape(Xinputs)[0]
state = self.cell_L1.zero_state(batch_size,Xinputs.dtype)
Here is a good link to TensorFlow FAQ which describes this approach 'How do I build a graph that works with variable batch sizes?':
https://www.tensorflow.org/resources/faq