Microsoft CNTK Automatic Differentiation - cntk

According to Microsoft, CNTK includes automatic differentiation. For better understanding the source (which I've successfully built) I'd like to know which C++ classes implement AD and how it is implemented in CNTK?

CNTK class Function implements the AD (via Gradients method, to be precise). Neural networks are represented as multiple Function compositions like g(f(x)). Then derivative of function g is computed with respect to f like this:

Related

Applying FedSGD in Tensorflow Federated framework

In federated learning, the state-of-the-art and most known method is the federated averaging algorithm or (FedAvg)
and it can be easily applied in TFF using the function
tff.learning.build_federated_averaging_process
I want to understand how it differs from the function
tff.learning.algorithms.build_fed_sgd
and is there any other aggregation method I can apply using TFF? such as (FedSGD)

TFF: Customizing the model implementation

Please can anyone explain to me this part of tuto:
here is the link https://www.tensorflow.org/federated/tutorials/federated_learning_for_image_classification
Part:
Customizing the model implementation
Keras is the recommended high-level model API for TensorFlow, and we encourage using Keras models (via tff.learning.from_keras_model or tff.learning.from_compiled_keras_model) in TFF whenever possible.
However, tff.learning provides a lower-level model interface, tff.learning.Model, that exposes the minimal functionality necessary for using a model for federated learning. Directly implementing this interface (possibly still using building blocks like tf.keras.layers) allows for maximum customization without modifying the internals of the federated learning algorithms.
So let's do it all over again from scratch.
Defining model variables, forward pass, and metrics
The first step is to identify the TensorFlow variables we're going to work with. In order to make the following code more legible, let's define a data structure to represent the entire set. This will include variables such as weights and bias that we will train, as well as variables that will hold various cumulative statistics and counters we will update during training, such as loss_sum, accuracy_sum, and num_examples.
MnistVariables = collections.namedtuple(
'MnistVariables', 'weights bias num_examples loss_sum accuracy_sum')
Roughly analogous to the multiple paths Keras exposes to create a Keras model, TFF exposes multiple distinct ways of creating a tff.learning.Model. One of them is through the constructor functions, tff.learning.from_keras_model or tff.learning.from_compiled_keras_model, but each of these functions constructs and returns an instance of the abstract base class tff.learning.Model; the purpose of this section of the tutorial is to show that it is possible to instead directly construct such an instance by implementing the appropriate methods in the abstract interface.
If it is the collections.namedtuple MnistVariables you are asking about, it is simply a data container class introduced for convenience, to help group the tf.Variables which will roughly be used by the TFF runtime to track state during training. One important thing to note from the tff.learning.Model documentation, evidenced by this tutorial, is the line:
All tf.Variables should be introduced in __init__
If you are familiar with TensorFlow Variables, you will understand that controlling their instantiation is quite important.

Complete Beta (or Gamma) function in Tensorflow

I've read some article about using other distribution to modeling a stochastic policy in Reinforcement Learning. Usually we use a Gaussian distribution but some used Beta distribution : https://en.wikipedia.org/wiki/Beta_distribution
There is already a Beta distribution class inside Tensorflow, allow people to use it as Tensors.
But for some policy gradient methods, they are using constraint on the optimization process, using the Kullback Leiber Divergence.
In the formula, there is the digamma function, already implemented in Tensorflow. But I can't find the beta function (nor the gamma function since they're linked) in Tensorflow. Only log gamma or incomplete gamma. And I cannot use the scipy.special.beta function because it cannot manipulate tensors (since my alpha and beta parameters are produced by a neural network)
I'm not specialist enough in this field, perhaps my question is foolish, but I'd really like an explanation there.
Thanks a lot

skipping layer in backpropagation in keras

I am using Keras with tensorflow backend and I am curious whether it is possible to skip a layer during backpropagation but have it execute in the forward pass. So here is what I mean
Lambda (lambda x: a(x))
I want to apply a to x in the forward pass but I do not want a to be included in the derivation when the backprop takes place.
I was trying to find a solution bit I could not find anything. Can somebody help me out here?
UPDATE 2
In addition to tf.py_func, there is now an official guide on how to add a custom op.
UPDATE
See this question for an example of writing a custom op with gradient purely in Python without needing to rebuild anything. Note that there are some limitations to the method (see the documentation of tf.py_func).
Not exactly a solution to the problem, but still kind of an answer and too long for comments.
That's not even a Keras issue, but a TensorFlow one. Each op defines its own gradient computation that is used during backpropagation. I you really wanted to something like that, you would need to implement the op into TensorFlow yourself (no easy feat) and define the gradient that you want - because you can't have "no gradient", if anything it would be 1 or 0 (otherwise you can't go on with backpropagation). There is a tf.NoGradient function in TensorFlow which causes an op to propagate zeros, but I don't think it is meant to / can be used out of TensorFlow own internals.
UPDATE
Okay so a bit more of context. TensorFlow graphs are built of ops, which are implemented by kernels; this is basically a 1-to-1 mapping, except that there may be for example a CPU and a GPU kernel for an op, hence the differentiation. The set of ops supported by TensorFlow is usually static, I mean it can change with newer versions, but in principle you cannot add your own ops, because the ops of a graph go into the Protobuf serialized format, so if you made your own ops then you would not be able to share your graph. Ops are then defined at C++ level with the macro REGISTER_OP (see for example here), and kernels with REGISTER_KERNEL_BUILDER (see for example here).
Now, where do gradients come into play? Well, the funny thing is that the gradient of an op is not defined at C++ level; there are ops (and kernels) that implement the gradient of other ops (if you look at the previous files you'll find ops/kernels with the name ending in Grad), but (as far as I'm aware) these are not explicitly "linked" at this level. It seems that the associations between ops and their gradients is defined in Python, usually via tf.RegisterGradient or the aforementioned tf.NoGradient (see for example here, Python modules starting with gen_ are autogenerated with the help of the C++ macros); these registrations inform the backpropagation algorithm about how to compute the gradient of the graph.
So, how to actually work this out? Well, you need to create at least one op in C++ with the corresponding kernel/s implementing the computation that you want for your forward pass. Then, if the gradient computation that you want to use can be expressed with existing TensorFlow ops (which is most likely), you would just need to call tf.RegisterGradient in Python and do the computation there in "standard" TensorFlow. This is quite complicated, but the good news is it's possible, and there's even an example for it (although I think they kinda forgot the gradient registration part in that one)! As you will see, the process involves compiling the new op code into a library (btw I'm not sure if any of this may work on Windows) that is then loaded from Python (obviously this involves going through the painful process of manual compilation of TensorFlow with Bazel). A possibly more realistic example can be found in TensorFlow Fold, an extension of TensorFlow for structured data that register (as of one) one custom operation here through a macro defined here that calls REGISTER_OP, and then in Python it loads the library and register its gradient here through their own registration function defined here that simply calls tf.NotDifferentiable (another name for tf.NoGradient)
tldr: It is rather hard, but it can be done and there are even a couple of examples out there.
As mentioned in #jdehesa's comments. You can implement your function with an "alternative gradient". Forgive me if my math is not correct, but I think a derivative returning "1" would be the correct way to have no effect on the backpropagation while still passing the learning through. For how to construct it, see here. The example I cited goes further and allows you to construct an activation function from a python function. So in place of the spiky function, substitute your function a, and in place of his derivative d_spiky replace it with
def constant(x):
return 1
So on the forward pass, a is applied in the layer and the the backwards pass 1 is applied which should simply pass the weight adjustments through.
You can then just create an Activation layer in Keras using this function.

Why does TensorFlow have a lot of mathematical equations re-implemented?

I was looking through the API in TensorFlow and notice that a lot of mathematical operations that already exist in python and numpy have been re-implemented (or at least given a tensorflow interface). For example:
is there a good reason to do this?
I've been searching over their page but can't find why they'd do this.
I do have some guesses though. One of my main guesses is that they probably want those operations to have some backpropagation effect on whatever Neural network graph that gets implementat. In other words, have their derivatives implemented. Is this one of the reasons? (wish I knew how to even check if my guess is right)
For example, in one of the most basic examples of linear regression, one defines the prediction function that one wants to implement:
product = tf.matmul(x,W)
y = product + b
instead of
product = tf.matmul(x,W)
y = tf.add(product, b)
Somehow the first implementation does not interfere with Stochastic Gradient Descent algorithm for training, so it probably doesn't matter if one uses numpy or tf.add to train? This is one aspect that confuses me, when do I know which one should I be using.
Or maybe they are performance reasons? Or maybe its to give those operations access to GPU if required to use GPUs?
You have to understand that you create a tensorflow graph with this operation, meaning they aren't the same as the numpy functions, they are more an abstraction of them.
Maybe you have noticed that you have to create a session and then evaluate the functions through that session to get a result, where with numpy functions they are executed directly. this is because this graph and its functions define what to do like writing down a formula, but to get results for a specific x (or whatever) you have to insert a value for x. This is what your doing through session and eval.
So to conclude this you define a graph with tensorflow which is a more abstract representation of the functions and the graph also isn't executed at runtime, then it is defined, it will be executed when you call the eval function and through that run the session.
Also notice that you cant mix numpy functions and tensorflow functions directly but you can define own tensorflow functions (https://www.tensorflow.org/versions/r0.9/how_tos/adding_an_op/index.html)
Btw I guess most of the tensorflow functions are using numpy under the hood. :)