How can I get mxnet back end code for various functions? - mxnet

I am trying to understand how the internal flow goes in mxnet when we call forward . Is there any way to get source code of mxnet?

This really depends on what your symbolic graph looks like. I assume you use MXNet with Python (Python documentation). There you can choose to use the MXNet symbol library or the Gluon library.
Now, you were asking whether one can inspect the code, and, yes, you can find it on GitHub. The folder python contains the python interface and src contains all MXNet sources. What happens on forward is eventually defined by the MXNet execution engine, which tracks input/output dependencies of operators and neural network layers, allocate memory on the different devices (CPU, GPUs). There is a general architecture documentation for this.
I suppose you are interested in what each and every operation does, such as argmax (reduction), tanh (unary math operation) or convolution (complex neural network operation). This you can find in the operator folder of MXNet. This requires a whole documentation in itself and there is a special forum for MXNet specifics here, but I will give a short orientation:
Each operation in a (symbolic) execution graph needs a defined forward and backward operation. It also needs to define its output shape, so that it can be chained with other operations. If that operator needs weights, it needs to define the amount of memory it requires, so MXNet can allocate it.
Each operation requires several implementations for a) CPU b) GPU (CUDA) c) wrapper around cuDNN
All unary math operations follow the same pattern, so they are all defined in a similar way in mshadow_op.h (e.g. relu).
This is all I can tell you based on your quite broad question.

Related

What is the difference of static Computational Graphs in tensorflow and dynamic Computational Graphs in Pytorch?

When I was learning tensorflow, one basic concept of tensorflow was computational graphs, and the graphs was said to be static.
And I found in Pytorch, the graphs was said to be dynamic.
What's the difference of static Computational Graphs in tensorflow and dynamic Computational Graphs in Pytorch?
Both frameworks operate on tensors and view any model as a directed acyclic graph (DAG), but they differ drastically on how you can define them.
TensorFlow follows ‘data as code and code is data’ idiom. In TensorFlow you define graph statically before a model can run. All communication with outer world is performed via tf.Session object and tf.Placeholder which are tensors that will be substituted by external data at runtime.
In PyTorch things are way more imperative and dynamic: you can define, change and execute nodes as you go, no special session interfaces or placeholders. Overall, the framework is more tightly integrated with Python language and feels more native most of the times. When you write in TensorFlow sometimes you feel that your model is behind a brick wall with several tiny holes to communicate over. Anyways, this still sounds like a matter of taste more or less.
However, those approaches differ not only in a software engineering perspective: there are several dynamic neural network architectures that can benefit from the dynamic approach. Recall RNNs: with static graphs, the input sequence length will stay constant. This means that if you develop a sentiment analysis model for English sentences you must fix the sentence length to some maximum value and pad all smaller sequences with zeros. Not too convenient, huh. And you will get more problems in the domain of recursive RNNs and tree-RNNs. Currently Tensorflow has limited support for dynamic inputs via Tensorflow Fold. PyTorch has it by-default.
Reference:
https://medium.com/towards-data-science/pytorch-vs-tensorflow-spotting-the-difference-25c75777377b
https://www.reddit.com/r/MachineLearning/comments/5w3q74/d_so_pytorch_vs_tensorflow_whats_the_verdict_on/
Both TensorFlow and PyTorch allow specifying new computations at any point in time. However, TensorFlow has a "compilation" steps which incurs performance penalty every time you modify the graph. So TensorFlow optimal performance is achieved when you specify the computation once, and then flow new data through the same sequence of computations.
It's similar to interpreters vs. compilers -- the compilation step makes things faster, but also discourages people from modifying the program too often.
To make things concrete, when you modify the graph in TensorFlow (by appending new computations using regular API, or removing some computation using tf.contrib.graph_editor), this line is triggered in session.py. It will serialize the graph, and then the underlying runtime will rerun some optimizations which can take extra time, perhaps 200usec. In contrast, running an op in previously defined graph, or in numpy/PyTorch can be as low as 1 usec.
In tensorflow you first have to define the graph, then you execute it.
Once defined you graph is immutable: you can't add/remove nodes at runtime.
In pytorch, instead, you can change the structure of the graph at runtime: you can thus add/remove nodes at runtime, dynamically changing its structure.

TensorFlow Conv2D implementation?

I'm trying to find where the implementation of an actual Conv2D operation is so I can assess memory access patterns. Tracing things around, it looks like execution of a Conv2D operation enters Eigen with a contract() function call. The problem is, I can't seem to find the definition or declaration of the function in either TensorFlow or Eigen source.
What functions are largely responsible for the execution of a Conv2D operation in TensorFlow? I'd like to see how it is paralyzed, what the general memory access pattern is, and how the raw computations are done.
This query is for CPU specifically, as I've already looked into GPU execution to a degree.
After some searching, I found actual implementation of CPU Conv2D is in deep_conv2d.cc.
I think Conv2dCPU is implemented in this file using Eigen conv ops Line 61 onwards
contract() returns an abstract expression whose evaluation is implemented in TensorContraction.h. It is essentially a wrapper on top of Eigen's matrix-matrix or matrix-vector products.

Flink-like barrier for Tensorflow

Is there an equivalent of a Flink barrier for Tensorflow?
There seems to not be a way to interact with the executor from any given kernel except by throwing an exception, and any deviation from a "pure" dataflow execution is not allowed, such as
Producing no output for a given input
Producing multiple outputs for a given input (e.g. splitting a sentence into words). I get around this by having such a kernel take a queue reference and do the enqueuing itself, but this feels like a modularity violation.
Receiving some sort of "control tuple / Tensor" so that multiple kernels can synchronize at some point (e.g. to implement a barrier). In other words, the only schedulable code for each kernel is Compute() on the normal Input and Output Tensors.
Is there any way to get Tensorflow to be able to behave more like a streaming framework? Is using Tensorflow as a streaming framework an unintended / improper use of it?
While TensorFlow kernels can't behave like proper units in a streaming framework, as they are, as you pointed out, called once per set of inputs and expected to produce one set of outputs each time they're called, there are alternatives.
The tf.contrib.data framework is built on the concept of a Dataset, which is a unit which has all the properties you specified above (maybe not the control tuple yet, but it'd be easy to add).
Have you considered using tge recently released Flink ML "integration" with Tensorflow?
https://github.com/FlinkML/flink-tensorflow

skipping layer in backpropagation in keras

I am using Keras with tensorflow backend and I am curious whether it is possible to skip a layer during backpropagation but have it execute in the forward pass. So here is what I mean
Lambda (lambda x: a(x))
I want to apply a to x in the forward pass but I do not want a to be included in the derivation when the backprop takes place.
I was trying to find a solution bit I could not find anything. Can somebody help me out here?
UPDATE 2
In addition to tf.py_func, there is now an official guide on how to add a custom op.
UPDATE
See this question for an example of writing a custom op with gradient purely in Python without needing to rebuild anything. Note that there are some limitations to the method (see the documentation of tf.py_func).
Not exactly a solution to the problem, but still kind of an answer and too long for comments.
That's not even a Keras issue, but a TensorFlow one. Each op defines its own gradient computation that is used during backpropagation. I you really wanted to something like that, you would need to implement the op into TensorFlow yourself (no easy feat) and define the gradient that you want - because you can't have "no gradient", if anything it would be 1 or 0 (otherwise you can't go on with backpropagation). There is a tf.NoGradient function in TensorFlow which causes an op to propagate zeros, but I don't think it is meant to / can be used out of TensorFlow own internals.
UPDATE
Okay so a bit more of context. TensorFlow graphs are built of ops, which are implemented by kernels; this is basically a 1-to-1 mapping, except that there may be for example a CPU and a GPU kernel for an op, hence the differentiation. The set of ops supported by TensorFlow is usually static, I mean it can change with newer versions, but in principle you cannot add your own ops, because the ops of a graph go into the Protobuf serialized format, so if you made your own ops then you would not be able to share your graph. Ops are then defined at C++ level with the macro REGISTER_OP (see for example here), and kernels with REGISTER_KERNEL_BUILDER (see for example here).
Now, where do gradients come into play? Well, the funny thing is that the gradient of an op is not defined at C++ level; there are ops (and kernels) that implement the gradient of other ops (if you look at the previous files you'll find ops/kernels with the name ending in Grad), but (as far as I'm aware) these are not explicitly "linked" at this level. It seems that the associations between ops and their gradients is defined in Python, usually via tf.RegisterGradient or the aforementioned tf.NoGradient (see for example here, Python modules starting with gen_ are autogenerated with the help of the C++ macros); these registrations inform the backpropagation algorithm about how to compute the gradient of the graph.
So, how to actually work this out? Well, you need to create at least one op in C++ with the corresponding kernel/s implementing the computation that you want for your forward pass. Then, if the gradient computation that you want to use can be expressed with existing TensorFlow ops (which is most likely), you would just need to call tf.RegisterGradient in Python and do the computation there in "standard" TensorFlow. This is quite complicated, but the good news is it's possible, and there's even an example for it (although I think they kinda forgot the gradient registration part in that one)! As you will see, the process involves compiling the new op code into a library (btw I'm not sure if any of this may work on Windows) that is then loaded from Python (obviously this involves going through the painful process of manual compilation of TensorFlow with Bazel). A possibly more realistic example can be found in TensorFlow Fold, an extension of TensorFlow for structured data that register (as of one) one custom operation here through a macro defined here that calls REGISTER_OP, and then in Python it loads the library and register its gradient here through their own registration function defined here that simply calls tf.NotDifferentiable (another name for tf.NoGradient)
tldr: It is rather hard, but it can be done and there are even a couple of examples out there.
As mentioned in #jdehesa's comments. You can implement your function with an "alternative gradient". Forgive me if my math is not correct, but I think a derivative returning "1" would be the correct way to have no effect on the backpropagation while still passing the learning through. For how to construct it, see here. The example I cited goes further and allows you to construct an activation function from a python function. So in place of the spiky function, substitute your function a, and in place of his derivative d_spiky replace it with
def constant(x):
return 1
So on the forward pass, a is applied in the layer and the the backwards pass 1 is applied which should simply pass the weight adjustments through.
You can then just create an Activation layer in Keras using this function.

Why does TensorFlow have a lot of mathematical equations re-implemented?

I was looking through the API in TensorFlow and notice that a lot of mathematical operations that already exist in python and numpy have been re-implemented (or at least given a tensorflow interface). For example:
is there a good reason to do this?
I've been searching over their page but can't find why they'd do this.
I do have some guesses though. One of my main guesses is that they probably want those operations to have some backpropagation effect on whatever Neural network graph that gets implementat. In other words, have their derivatives implemented. Is this one of the reasons? (wish I knew how to even check if my guess is right)
For example, in one of the most basic examples of linear regression, one defines the prediction function that one wants to implement:
product = tf.matmul(x,W)
y = product + b
instead of
product = tf.matmul(x,W)
y = tf.add(product, b)
Somehow the first implementation does not interfere with Stochastic Gradient Descent algorithm for training, so it probably doesn't matter if one uses numpy or tf.add to train? This is one aspect that confuses me, when do I know which one should I be using.
Or maybe they are performance reasons? Or maybe its to give those operations access to GPU if required to use GPUs?
You have to understand that you create a tensorflow graph with this operation, meaning they aren't the same as the numpy functions, they are more an abstraction of them.
Maybe you have noticed that you have to create a session and then evaluate the functions through that session to get a result, where with numpy functions they are executed directly. this is because this graph and its functions define what to do like writing down a formula, but to get results for a specific x (or whatever) you have to insert a value for x. This is what your doing through session and eval.
So to conclude this you define a graph with tensorflow which is a more abstract representation of the functions and the graph also isn't executed at runtime, then it is defined, it will be executed when you call the eval function and through that run the session.
Also notice that you cant mix numpy functions and tensorflow functions directly but you can define own tensorflow functions (https://www.tensorflow.org/versions/r0.9/how_tos/adding_an_op/index.html)
Btw I guess most of the tensorflow functions are using numpy under the hood. :)