Flink-like barrier for Tensorflow - tensorflow

Is there an equivalent of a Flink barrier for Tensorflow?
There seems to not be a way to interact with the executor from any given kernel except by throwing an exception, and any deviation from a "pure" dataflow execution is not allowed, such as
Producing no output for a given input
Producing multiple outputs for a given input (e.g. splitting a sentence into words). I get around this by having such a kernel take a queue reference and do the enqueuing itself, but this feels like a modularity violation.
Receiving some sort of "control tuple / Tensor" so that multiple kernels can synchronize at some point (e.g. to implement a barrier). In other words, the only schedulable code for each kernel is Compute() on the normal Input and Output Tensors.
Is there any way to get Tensorflow to be able to behave more like a streaming framework? Is using Tensorflow as a streaming framework an unintended / improper use of it?

While TensorFlow kernels can't behave like proper units in a streaming framework, as they are, as you pointed out, called once per set of inputs and expected to produce one set of outputs each time they're called, there are alternatives.
The tf.contrib.data framework is built on the concept of a Dataset, which is a unit which has all the properties you specified above (maybe not the control tuple yet, but it'd be easy to add).

Have you considered using tge recently released Flink ML "integration" with Tensorflow?
https://github.com/FlinkML/flink-tensorflow

Related

Programmatic Hyperparameter Tuning for TensorFlow Object Detection API

I am using the TF Object Detection API. I have a custom data set. I am training using SLURM jobs and calling the API scripts from within there. I am looking to try and tune hyperparameters found in the pipeline.config files. Unfortunately, in the documentation, this kind of process is not outlined. It seems like the process is to either use the sample configs or tune the hyperparameters by hand.
Tuning by hand is somewhat feasible, for example adjusting for two parameters for three values (batch size and steps) results in nine different .configs, but adding another hyperparameter to that boosts it up to twenty-seven files I need to keep track of. This does not seem like a good way to do it, particularly because it limits the values I can try and is clumsy.
It seems like there are libraries out there that hook into Keras and other more high-level frameworks, but I have found nothing that looks like it can take the results of the Object Detection API and actually optimize it.
Is it possible to do this with a pre-built library I don't know about? I would like to avoid having to edit the API implementation or coding this myself to minimize errors.

How can I get mxnet back end code for various functions?

I am trying to understand how the internal flow goes in mxnet when we call forward . Is there any way to get source code of mxnet?
This really depends on what your symbolic graph looks like. I assume you use MXNet with Python (Python documentation). There you can choose to use the MXNet symbol library or the Gluon library.
Now, you were asking whether one can inspect the code, and, yes, you can find it on GitHub. The folder python contains the python interface and src contains all MXNet sources. What happens on forward is eventually defined by the MXNet execution engine, which tracks input/output dependencies of operators and neural network layers, allocate memory on the different devices (CPU, GPUs). There is a general architecture documentation for this.
I suppose you are interested in what each and every operation does, such as argmax (reduction), tanh (unary math operation) or convolution (complex neural network operation). This you can find in the operator folder of MXNet. This requires a whole documentation in itself and there is a special forum for MXNet specifics here, but I will give a short orientation:
Each operation in a (symbolic) execution graph needs a defined forward and backward operation. It also needs to define its output shape, so that it can be chained with other operations. If that operator needs weights, it needs to define the amount of memory it requires, so MXNet can allocate it.
Each operation requires several implementations for a) CPU b) GPU (CUDA) c) wrapper around cuDNN
All unary math operations follow the same pattern, so they are all defined in a similar way in mshadow_op.h (e.g. relu).
This is all I can tell you based on your quite broad question.

Why is TensorFlow while_loop node required?

Why does the basic static, compiled computation graph structure of TF (as opposed to a dynamic graph) necessitate a dedicated while loop node and doesn't enable the use "regular" Python control flow expressions?
Thanks.
TensorFlow builds the computational graph and makes it static (unchangeable) for efficiency. Once it's finalized, telling the TensorFlow graph to do something is like sending some input to a separate program which you can no longer change besides passing in different inputs. So the TensorFlow graph at that point has no knowledge of the Python control flow. It just runs when called. Because of this, it needs to explicitly know ahead of time where you want to add in a while loop inside the TensorFlow graph. You can however, still use Python control flow and just call the TensorFlow graph as though it were a specific function.

What is the difference of static Computational Graphs in tensorflow and dynamic Computational Graphs in Pytorch?

When I was learning tensorflow, one basic concept of tensorflow was computational graphs, and the graphs was said to be static.
And I found in Pytorch, the graphs was said to be dynamic.
What's the difference of static Computational Graphs in tensorflow and dynamic Computational Graphs in Pytorch?
Both frameworks operate on tensors and view any model as a directed acyclic graph (DAG), but they differ drastically on how you can define them.
TensorFlow follows ‘data as code and code is data’ idiom. In TensorFlow you define graph statically before a model can run. All communication with outer world is performed via tf.Session object and tf.Placeholder which are tensors that will be substituted by external data at runtime.
In PyTorch things are way more imperative and dynamic: you can define, change and execute nodes as you go, no special session interfaces or placeholders. Overall, the framework is more tightly integrated with Python language and feels more native most of the times. When you write in TensorFlow sometimes you feel that your model is behind a brick wall with several tiny holes to communicate over. Anyways, this still sounds like a matter of taste more or less.
However, those approaches differ not only in a software engineering perspective: there are several dynamic neural network architectures that can benefit from the dynamic approach. Recall RNNs: with static graphs, the input sequence length will stay constant. This means that if you develop a sentiment analysis model for English sentences you must fix the sentence length to some maximum value and pad all smaller sequences with zeros. Not too convenient, huh. And you will get more problems in the domain of recursive RNNs and tree-RNNs. Currently Tensorflow has limited support for dynamic inputs via Tensorflow Fold. PyTorch has it by-default.
Reference:
https://medium.com/towards-data-science/pytorch-vs-tensorflow-spotting-the-difference-25c75777377b
https://www.reddit.com/r/MachineLearning/comments/5w3q74/d_so_pytorch_vs_tensorflow_whats_the_verdict_on/
Both TensorFlow and PyTorch allow specifying new computations at any point in time. However, TensorFlow has a "compilation" steps which incurs performance penalty every time you modify the graph. So TensorFlow optimal performance is achieved when you specify the computation once, and then flow new data through the same sequence of computations.
It's similar to interpreters vs. compilers -- the compilation step makes things faster, but also discourages people from modifying the program too often.
To make things concrete, when you modify the graph in TensorFlow (by appending new computations using regular API, or removing some computation using tf.contrib.graph_editor), this line is triggered in session.py. It will serialize the graph, and then the underlying runtime will rerun some optimizations which can take extra time, perhaps 200usec. In contrast, running an op in previously defined graph, or in numpy/PyTorch can be as low as 1 usec.
In tensorflow you first have to define the graph, then you execute it.
Once defined you graph is immutable: you can't add/remove nodes at runtime.
In pytorch, instead, you can change the structure of the graph at runtime: you can thus add/remove nodes at runtime, dynamically changing its structure.

Sharing Queue between two graphs in tensorflow

Is it possible to share a queue between two graphs in TensorFlow? I'd like to do a kind of bootstrapping to select "hard negative" examples during training.
To speed up the process, I want separate threads for hard negative example selection, and for the training process. The hard negative selection is based on the evaluation of the current model, and it will load its graph from a checkpoint file. The training graph is run on another thread and writes the checkpoint file. The two graphs should share the same queue: the training graph will consume examples and the hard negative selection will produce them.
Currently there's no support for sharing state between different graphs in the open-source version of TensorFlow: each graph runs in a separate session, and each session uses an isolated set of devices.
However, it seems like it would be possible to achieve your goal using a queue in single graph. Simply construct a queue (using e.g. tf.FIFOQueue) and use tf.import_graph_def() to import the graph from the checkpoint file into the current graph. Using the return_elements argument to tf.import_graph_def() you can specify the name of the tensor that will contain the negative examples, and then add a q.enqueue_many() operation to add them to your queue. You would then fork a thread to run the enqueue_many operation in a loop. In your training graph, you can use q.dequeue_many() to get a batch of negative examples, and use them as the input to your training process.