tensorflow eager and imperative custom layers - tensorflow

In one deep learning notes (Stanford cs20si), I once saw the following statement regarding eager. I don't quite understand what does the imperative custom layers indicate, and how to understand this code example in the context of imperative custom layers?

Normally, using tensorflow you are not able to access the content of a tensor directly. This means, that you are not able to use if-statements. Instead, you have to construct both possible branches of the branch and then use tf.conditional to include a node which switches between these two, depending on the content of a tensor. This makes it sometimes hard to implement imperative commands in layers.
The example you posted above show, that you are now (with eager execution) able to access the content of tensors, which means, that you can write all the if-statements, for-loops and so on, directly in python and you do not have to construct a huge graph on your own for each possibility. As the code inside the layer is now executed just like a normal imperative programming language, you can call this kind of layer an imperative layer - this is identical to the motivation behind PyTorch.

Related

What is the difference in purpose between tf.py_function and tf.function?

The difference between the two is muddled in my head, notwithstanding the nuances of what is eager and what isn't. From what I gather, the #tf.function decorator has two benefits in that
it converts functions into TensorFlow graphs for performance, and
allows for a more Pythonic style of coding by interpreting many (but not all) common-place Python operations into tensor operations, e.g. if into tf.cond, etc.
From the definition of tf.py_function, it seems that it does just #2 above. Hence, why bother with tf.py_function when tf.function does the job with a performance improvement to boot and without the inability of the former to serialize?
They do indeed start to resemble each other as they are improved, so it is useful to see where they come from. Initially, the difference was that:
#tf.function turns python code into a series of TensorFlow graph nodes.
tf.py_function wraps an existing python function into a single graph node.
This means that tf.function requires your code to be relatively simple while tf.py_function can handle any python code, no matter how complex.
While this line is indeed blurring, with tf.py_function doing more interpretation and tf.function accepting lot's of complex python commands, the general rule stays the same:
If you have relatively simple logic in your python code, use tf.function.
When you use complex code, like large external libraries (e.g. connecting to a database, or loading a large external NLP package) use tf.py_function.

TFF: Customizing the model implementation

Please can anyone explain to me this part of tuto:
here is the link https://www.tensorflow.org/federated/tutorials/federated_learning_for_image_classification
Part:
Customizing the model implementation
Keras is the recommended high-level model API for TensorFlow, and we encourage using Keras models (via tff.learning.from_keras_model or tff.learning.from_compiled_keras_model) in TFF whenever possible.
However, tff.learning provides a lower-level model interface, tff.learning.Model, that exposes the minimal functionality necessary for using a model for federated learning. Directly implementing this interface (possibly still using building blocks like tf.keras.layers) allows for maximum customization without modifying the internals of the federated learning algorithms.
So let's do it all over again from scratch.
Defining model variables, forward pass, and metrics
The first step is to identify the TensorFlow variables we're going to work with. In order to make the following code more legible, let's define a data structure to represent the entire set. This will include variables such as weights and bias that we will train, as well as variables that will hold various cumulative statistics and counters we will update during training, such as loss_sum, accuracy_sum, and num_examples.
MnistVariables = collections.namedtuple(
'MnistVariables', 'weights bias num_examples loss_sum accuracy_sum')
Roughly analogous to the multiple paths Keras exposes to create a Keras model, TFF exposes multiple distinct ways of creating a tff.learning.Model. One of them is through the constructor functions, tff.learning.from_keras_model or tff.learning.from_compiled_keras_model, but each of these functions constructs and returns an instance of the abstract base class tff.learning.Model; the purpose of this section of the tutorial is to show that it is possible to instead directly construct such an instance by implementing the appropriate methods in the abstract interface.
If it is the collections.namedtuple MnistVariables you are asking about, it is simply a data container class introduced for convenience, to help group the tf.Variables which will roughly be used by the TFF runtime to track state during training. One important thing to note from the tff.learning.Model documentation, evidenced by this tutorial, is the line:
All tf.Variables should be introduced in __init__
If you are familiar with TensorFlow Variables, you will understand that controlling their instantiation is quite important.

How to use tf.layers classes instead of functions

It seems that tf.Layer modules come in two flavours: functions and classes. I normally use the functions directly (e.g, tf.layers.dense) but I'd like to know how to use classes directly (tf.layers.Dense). I've started experimenting with the new eager execution mode in tensorflow and I think using classes are going to be useful there as well but I haven't seen good examples in the documentation. Is there any part of TF documentation that shows how these are used?
I guess it would make sense to use them in a class where these layers are instantiated in the __init__ and then they're linked in the __call__ method when the inputs and dimensions are known?
Are these tf.layer classes related to tf.keras.Model? Is there an equivalent wrapper class for using tf.layers?
Update: for eager execution there's tfe.Network that must be inherited. There's an example here
tf.layers and tf.keras.layer classes are generally interchangeable and in fact at head (and thus by the next release - 1.9), the former actually inherits from the latter.
TensorFlow is moving towards consolidating on tf.keras APIs for constructing models as that makes state ownership more explicit (e.g., parameters are "owned" by the Layer object, as opposed to the functional style where all model parameters are put in a "collection" associated with the complete graph). This style works well for both eager execution and graph construction (support for eager execution is improving with every release). I'd recommend using tf.keras.layers and tf.keras.Model.
Some examples that you may find useful:
MNIST in the tensorflow/models repository
The programmer's guide
Other eager execution samples (where the exact same model definition works for both graph execution and eager execution).
Not all existing TensorFlow examples have been moved to this style, but they slowly will.
Hope that helps.

What is the difference of static Computational Graphs in tensorflow and dynamic Computational Graphs in Pytorch?

When I was learning tensorflow, one basic concept of tensorflow was computational graphs, and the graphs was said to be static.
And I found in Pytorch, the graphs was said to be dynamic.
What's the difference of static Computational Graphs in tensorflow and dynamic Computational Graphs in Pytorch?
Both frameworks operate on tensors and view any model as a directed acyclic graph (DAG), but they differ drastically on how you can define them.
TensorFlow follows ‘data as code and code is data’ idiom. In TensorFlow you define graph statically before a model can run. All communication with outer world is performed via tf.Session object and tf.Placeholder which are tensors that will be substituted by external data at runtime.
In PyTorch things are way more imperative and dynamic: you can define, change and execute nodes as you go, no special session interfaces or placeholders. Overall, the framework is more tightly integrated with Python language and feels more native most of the times. When you write in TensorFlow sometimes you feel that your model is behind a brick wall with several tiny holes to communicate over. Anyways, this still sounds like a matter of taste more or less.
However, those approaches differ not only in a software engineering perspective: there are several dynamic neural network architectures that can benefit from the dynamic approach. Recall RNNs: with static graphs, the input sequence length will stay constant. This means that if you develop a sentiment analysis model for English sentences you must fix the sentence length to some maximum value and pad all smaller sequences with zeros. Not too convenient, huh. And you will get more problems in the domain of recursive RNNs and tree-RNNs. Currently Tensorflow has limited support for dynamic inputs via Tensorflow Fold. PyTorch has it by-default.
Reference:
https://medium.com/towards-data-science/pytorch-vs-tensorflow-spotting-the-difference-25c75777377b
https://www.reddit.com/r/MachineLearning/comments/5w3q74/d_so_pytorch_vs_tensorflow_whats_the_verdict_on/
Both TensorFlow and PyTorch allow specifying new computations at any point in time. However, TensorFlow has a "compilation" steps which incurs performance penalty every time you modify the graph. So TensorFlow optimal performance is achieved when you specify the computation once, and then flow new data through the same sequence of computations.
It's similar to interpreters vs. compilers -- the compilation step makes things faster, but also discourages people from modifying the program too often.
To make things concrete, when you modify the graph in TensorFlow (by appending new computations using regular API, or removing some computation using tf.contrib.graph_editor), this line is triggered in session.py. It will serialize the graph, and then the underlying runtime will rerun some optimizations which can take extra time, perhaps 200usec. In contrast, running an op in previously defined graph, or in numpy/PyTorch can be as low as 1 usec.
In tensorflow you first have to define the graph, then you execute it.
Once defined you graph is immutable: you can't add/remove nodes at runtime.
In pytorch, instead, you can change the structure of the graph at runtime: you can thus add/remove nodes at runtime, dynamically changing its structure.

skipping layer in backpropagation in keras

I am using Keras with tensorflow backend and I am curious whether it is possible to skip a layer during backpropagation but have it execute in the forward pass. So here is what I mean
Lambda (lambda x: a(x))
I want to apply a to x in the forward pass but I do not want a to be included in the derivation when the backprop takes place.
I was trying to find a solution bit I could not find anything. Can somebody help me out here?
UPDATE 2
In addition to tf.py_func, there is now an official guide on how to add a custom op.
UPDATE
See this question for an example of writing a custom op with gradient purely in Python without needing to rebuild anything. Note that there are some limitations to the method (see the documentation of tf.py_func).
Not exactly a solution to the problem, but still kind of an answer and too long for comments.
That's not even a Keras issue, but a TensorFlow one. Each op defines its own gradient computation that is used during backpropagation. I you really wanted to something like that, you would need to implement the op into TensorFlow yourself (no easy feat) and define the gradient that you want - because you can't have "no gradient", if anything it would be 1 or 0 (otherwise you can't go on with backpropagation). There is a tf.NoGradient function in TensorFlow which causes an op to propagate zeros, but I don't think it is meant to / can be used out of TensorFlow own internals.
UPDATE
Okay so a bit more of context. TensorFlow graphs are built of ops, which are implemented by kernels; this is basically a 1-to-1 mapping, except that there may be for example a CPU and a GPU kernel for an op, hence the differentiation. The set of ops supported by TensorFlow is usually static, I mean it can change with newer versions, but in principle you cannot add your own ops, because the ops of a graph go into the Protobuf serialized format, so if you made your own ops then you would not be able to share your graph. Ops are then defined at C++ level with the macro REGISTER_OP (see for example here), and kernels with REGISTER_KERNEL_BUILDER (see for example here).
Now, where do gradients come into play? Well, the funny thing is that the gradient of an op is not defined at C++ level; there are ops (and kernels) that implement the gradient of other ops (if you look at the previous files you'll find ops/kernels with the name ending in Grad), but (as far as I'm aware) these are not explicitly "linked" at this level. It seems that the associations between ops and their gradients is defined in Python, usually via tf.RegisterGradient or the aforementioned tf.NoGradient (see for example here, Python modules starting with gen_ are autogenerated with the help of the C++ macros); these registrations inform the backpropagation algorithm about how to compute the gradient of the graph.
So, how to actually work this out? Well, you need to create at least one op in C++ with the corresponding kernel/s implementing the computation that you want for your forward pass. Then, if the gradient computation that you want to use can be expressed with existing TensorFlow ops (which is most likely), you would just need to call tf.RegisterGradient in Python and do the computation there in "standard" TensorFlow. This is quite complicated, but the good news is it's possible, and there's even an example for it (although I think they kinda forgot the gradient registration part in that one)! As you will see, the process involves compiling the new op code into a library (btw I'm not sure if any of this may work on Windows) that is then loaded from Python (obviously this involves going through the painful process of manual compilation of TensorFlow with Bazel). A possibly more realistic example can be found in TensorFlow Fold, an extension of TensorFlow for structured data that register (as of one) one custom operation here through a macro defined here that calls REGISTER_OP, and then in Python it loads the library and register its gradient here through their own registration function defined here that simply calls tf.NotDifferentiable (another name for tf.NoGradient)
tldr: It is rather hard, but it can be done and there are even a couple of examples out there.
As mentioned in #jdehesa's comments. You can implement your function with an "alternative gradient". Forgive me if my math is not correct, but I think a derivative returning "1" would be the correct way to have no effect on the backpropagation while still passing the learning through. For how to construct it, see here. The example I cited goes further and allows you to construct an activation function from a python function. So in place of the spiky function, substitute your function a, and in place of his derivative d_spiky replace it with
def constant(x):
return 1
So on the forward pass, a is applied in the layer and the the backwards pass 1 is applied which should simply pass the weight adjustments through.
You can then just create an Activation layer in Keras using this function.