Should I bother with setting NCHW for training and NHWC for testing? - tensorflow

https://www.tensorflow.org/performance/performance_guide#use_nchw_image_data_format
I've read that cuDNN has better performance with NCHW (feature maps on the second axis) but that NHWC is better on CPU (feature maps on last axis).
As of TensorFlow 1.2, I wonder if it's still recommended to manually support both formats, or if it's reasonable to expect tf.train, tf.layers etc. to automatically take care of dimension reordering as needed (I believe they should!). Manually supporting both data formats feels ugly and like a leaky abstraction with implementation details that I as a TensorFlow user should not have to know about, hence I'd like to avoid it.
Also, how much of a performance improvement would one reasonably expect to gain from GPU training with NCHW instead of NHWC?

It would be interesting to know where you found that CPU executions are faster in NHWC mode. Intel MKL library for DNN uses the NCHW format by default, and as I understand, uses yet another opaque, SIMD-friendly format internally. So it you go NCHW, at least you wouldn't have to maintain two versions.
I don't know what order of gain you could expect. As CuDNN uses the NCHW order itself, I suppose tensorflow does not convert formats back and forth at each layer, but converts back into NHWC only when needed (e.g. when you explicitly ask for the tensor values). So unless you do lots of exotic stuff outside of standard CuDNN operations, I would not be surprised if gains are minor. But it is just an uneducated guess.

Related

Differences in tensorflow prediction on CPU and GPU for CNN models

I have trained an FCN network on a GPU and have saved the model(.pb file). I am getting correct predictions on the GPU. However i am getting NAN for the same model file when I am running predictions on CPU.
Are there any CPU/GPU flags that need to be set? Or are there any overflow issues with CPU?
There are no special overflow condition on the CPU. Both should implement IEEE 754.
There are different ways some high level functions can be implemented (tanh, sigmoid) and they are implemented differently on GPU vs CPU to take advantage of the platform.
Whenever you get NaN from your model something is most likely broken. Don't try to patch it with some flag, but instead try to debug and see what's going on. In almost all cases you have a degenerate model that only works because of some corner case of some hardware.
Once you've found the problem, it's usually fixed by capping some values or by modifying the way data is represented (taking log of large numbers for example).

CUDA-like optimization on Tensorflow-GPU

I am trying to implement a neural network architecture (Self Organizing Maps) for execution on GPUs. I am exploring TensorFlow for this task.
In TensorFlow, I noticed that you just have to specify gpu as the device to execute something on the gpu like in this post. It seems that the way the operations are parallelized is decided by TF and the user does not have options to take optimization decisions. The "Optimizing for GPU" section on TensorFlow Performance Guide also does not talk about explicit control over parallelizing operations.
My question is, can I do CUDA-like optimization in TensorFlow? More elaborately, is it possible to define which operation will be parallelized (like defining CUDA kernels for parallel operations)?
Yes, but you probably don't want to.
At the most extreme you can define your own op (as described here: https://www.tensorflow.org/extend/adding_an_op).
You can implement it as a GPU Kernel and write whatever you want.
You probably don't want to. The default operations are likely well optimized. I doubt you would be able to squeeze anything out significant out of them.
You can decide the device placement for each individual operation (by using tf.device), but you will incur data transfer overhead every time you switch. This should cover the cases where there's some operation that it slow to execute on the GPU.
If you want to process part of the data on CPU and part on the GPU you can slice your data and do 2 operations (one on CPU and one on GPU).
By default, in TF, in graph mode (not in eager mode), everything, all the TF ops run in parallel. There is a thread pool for that, and its size is controlled via inter_op_parallelism_threads. (See also.)
That does not necessarily mean that e.g. multiple matmul will really run in parallel, if they are internally synchronized. That is the case for most CUDA ops, as there is only a single CUDA stream. See here.

What is the difference of static Computational Graphs in tensorflow and dynamic Computational Graphs in Pytorch?

When I was learning tensorflow, one basic concept of tensorflow was computational graphs, and the graphs was said to be static.
And I found in Pytorch, the graphs was said to be dynamic.
What's the difference of static Computational Graphs in tensorflow and dynamic Computational Graphs in Pytorch?
Both frameworks operate on tensors and view any model as a directed acyclic graph (DAG), but they differ drastically on how you can define them.
TensorFlow follows ‘data as code and code is data’ idiom. In TensorFlow you define graph statically before a model can run. All communication with outer world is performed via tf.Session object and tf.Placeholder which are tensors that will be substituted by external data at runtime.
In PyTorch things are way more imperative and dynamic: you can define, change and execute nodes as you go, no special session interfaces or placeholders. Overall, the framework is more tightly integrated with Python language and feels more native most of the times. When you write in TensorFlow sometimes you feel that your model is behind a brick wall with several tiny holes to communicate over. Anyways, this still sounds like a matter of taste more or less.
However, those approaches differ not only in a software engineering perspective: there are several dynamic neural network architectures that can benefit from the dynamic approach. Recall RNNs: with static graphs, the input sequence length will stay constant. This means that if you develop a sentiment analysis model for English sentences you must fix the sentence length to some maximum value and pad all smaller sequences with zeros. Not too convenient, huh. And you will get more problems in the domain of recursive RNNs and tree-RNNs. Currently Tensorflow has limited support for dynamic inputs via Tensorflow Fold. PyTorch has it by-default.
Reference:
https://medium.com/towards-data-science/pytorch-vs-tensorflow-spotting-the-difference-25c75777377b
https://www.reddit.com/r/MachineLearning/comments/5w3q74/d_so_pytorch_vs_tensorflow_whats_the_verdict_on/
Both TensorFlow and PyTorch allow specifying new computations at any point in time. However, TensorFlow has a "compilation" steps which incurs performance penalty every time you modify the graph. So TensorFlow optimal performance is achieved when you specify the computation once, and then flow new data through the same sequence of computations.
It's similar to interpreters vs. compilers -- the compilation step makes things faster, but also discourages people from modifying the program too often.
To make things concrete, when you modify the graph in TensorFlow (by appending new computations using regular API, or removing some computation using tf.contrib.graph_editor), this line is triggered in session.py. It will serialize the graph, and then the underlying runtime will rerun some optimizations which can take extra time, perhaps 200usec. In contrast, running an op in previously defined graph, or in numpy/PyTorch can be as low as 1 usec.
In tensorflow you first have to define the graph, then you execute it.
Once defined you graph is immutable: you can't add/remove nodes at runtime.
In pytorch, instead, you can change the structure of the graph at runtime: you can thus add/remove nodes at runtime, dynamically changing its structure.

Does using Fp16 in deeplearning have adverse effect on the end result?

I see tensorflow offers the use of fp16 in training and testing, is it safe to use it or will it have an adverse effect on the final result?
according to wikipedia data,the half precision float point performance on nvidia cards is even worse than using double, and i am wondoring performance reports to confirm that idea(https://github.com/tensorflow/tensorflow/issues/1300,see the bottom of the page, a 10% performance was reported).
hence we can conclude that using fp16 is centered on the video ram issue(or,scale of model),training was to be carried out normally,
it is OK to use lower precision, binary connection was also investigated. fit would be archived anyway
In contrast, the amd cards can offer astonishing fp16 performance while software support is rare. e.g. theano and the libgpuarray backgroud on openCL.though...people use fp32 to calculate
It will affect the output while training, because of the extra math precision that float32 provides, but after training you can 'quantize' the operations in your network to float16 to have faster performances if your hardware supports the float16 natively. If the hardware does not support such operation you might likely have a slow down in terms of performances.

Compare deep learning framework between TensorFlow and PaddlePaddle

I want to study on the research of deep learning, but I don't know which framwork should I choice between TensorFlow and PaddlePaddle. who can make a contrast between the two frameworks? which one is better? especially in the running efficiency of CPU
It really depends what you are shooting for...
If you plan on training, CPU is not going to work well for you. Use colab or kaggle.
Assuming you do get a GPU, it depends if you want to focus on classification or object detection.
If you focus on classification, Keras is probably the easiest to work with or pytorch if you want some advanced stuff and to be able to change things.
If you plan on object detection, things are getting complicated... Inference is reasonably easy but training is complicated. There are actually 4 platforms you should consider:
Tensorflow - powerful but very difficult to work with. If you do not use Keras (and for OD you usually can't), you need to preprocess the dataset into tfrecords and it is a pain. The OD Api has very cryptic messages and it is very sensitive to the combination of tf version and api version. On the other hand, cool models like efficientdet are more or less easy to use.
MMdetection - very powerful framework, has lots of advanced models and once you understand how to work with it, you can easily work with and of the models it supports. Downside is that some models are slow to arrive (efficientdet, for example)
paddlepaddle - if you know Chinese, this should work ok, maybe. The documentation is a bit behind and usually requires lots of improvisation. Basically it is similar to mmdetection just with a few unique models and a few missing models.
detectron2 - I didn't work with this one, but it seems to support only a few models.
You probably need first to define for yourself what do you want to do and then choose.
Good luck!
It is not that trivial. Some models run faster with one kind of framework others with another. Furthermore, it depends on the hardware as well. See this blog. If inference is your only concern, then you can develop your model in any of the popular frameworks like TensorFlow, PyTorch, etc. In the end convert your model to ONNX format and benchmark its performance with DNN-Bench to choose the best inference engine for your application.