How to speed up Tensorflow-gpu with using CUDA code simultaneoulsy

How to speed up Tensorflow-gpu with using CUDA code simultaneoulsy - tensorflow

I only have one GPU(GTX 1070, 8GB VRAM) and I would like to using tensorflow-gpu with another CUDA code simultaneously, on the same GPU.
But, using CUDA code and tensorflow-gpu at the same time slows tensorflow-gpu down about twice time.
Is there any solutions to speed up when tensorflow-gpu and CUDA code are used together?

A slightly longer version of #talonmies comment:
GPUs are awesome, but they still have finite resources. Any competently-built application that uses the GPU will do its best to saturate the device, leaving few resources for other applications. In fact, one of the goals and challenges of optimizing GPU code - whether it be a shader, CUDA or CL kernel - is making sure that all CUs are used as efficiently as possible.
Assuming that TF is already doing that: When running another GPU-heavy application, or you're sharing a resource that's already running full-tilt. So, things slow down.
Some options are:
Get a second, or faster, GPU.
Optimize your CUDA kernels to reduce requirements and simplify your TF stuff. While this is always important to keep in mind when developing for GPGPU, it's unlikely to help with your current problem.
Don't run these things at the same time. This may turn out to be slightly faster than this quasi time-slicing situation that you currently have.

Related

Does tensorflow-quantum support GPU, and if so how do I make it use mine?

I am getting started on using tensorflow-quantum for some QML circuit simulations. I have everything configured correctly for TensorFlow with GPU, and when I run print(tf.config.list_physical_devices('GPU')), it reports the presence of my GPU.
However, I've done some Googling, and I've come across a few things suggesting that tensorflow-quantum doesn't actually support GPU acceleration for simulations (e.g. MichaelBroughton's first reply here, and this issue which is still open). However, it's unclear to me how up-to-date this state of affairs is. I can't find anything about adding GPU support in the version notes.
Does tensorflow-quantum currently support GPU? If so, how do I (a) make it use my GPU for simulations and (b) verify that it is doing so?

Is it safer to use OpenCL rather than SYCL when the objective is to have the most hardware-compatible program?

My objective is to obtain the ability of parallelizing a code in order to be able to run it on GPU, and the Graal would be to have a software that can run in parallel on any GPU or even CPU (Intel, NVIDIA, AMD, and so...).
From what I understood, the best solution would be to use OpenCL. But shortly after that, I also read about SYCL, that is supposed to simplify the codes that run on GPU.
But is it just that ? Isn't better to use a lower level language in order to be sure that it will be possible to be used in the most hardware possible ?
I know that all the compatibilities are listed on The Khronos Group website, but I am reading everything and its opposite on the Internet (like if a NVIDIA card supports CUDA, then it supports OpenCL, or NVIDIA cards will never work with OpenCL, even though OpenCL is supposed to work with everything)...
This is a new topic to me and there are lots of informations on the Internet... It would be great if someone could give me a simple answer to this question.

Probably yes.
OpenCL is supported on all AMD/Nvidia/Intel GPUs and on all Intel CPUs since around 2009. For best compatibility with almost any device, use OpenCL 1.2. The nice thing is that the OpenCL Runtime is included in the graphics drivers, so you don't have to install anything extra to work with it or to get it working on another machine.
SYCL on the other hand is newer and not yet established that well. For example, it is not officially supported (yet) on Nvidia GPUs: https://forums.developer.nvidia.com/t/is-sycl-available-on-cuda/37717/7
But there are already SYCL implememtations that are compatible with Nvidia/AMD GPUs, essentially built on top of CUDA or OpenCL 1.2, see here: https://stackoverflow.com/a/63468372/9178992

What does it do if I choose "None" in Hardware Accelerator?

Pretty straightforward question. I was just wondering what was doing the calculus when this option was chosen. Does it run on Goolge's CPU or on my hardware ?
I have looked on Google, Stackoverflow and Colab's Help without success finding a precise answer
Thanks :)
PS : When running a full Dense Network "without" accelarator it is approx. as fast as with TPU and a lot faster than with GPU.

Your guess is correct: None means CPU only, but on a Colab-managed cloud VM rather than your local machine. (Unless you've connected to a local Jupyter instance.
Also keep in mind that you'll need to adjust your code in order to take advantage of hardware accelerators like GPUs and TPUs.
Speedup on a GPU is often a bit magical since many frameworks automatically detect and take advantage of GPUs. Built-in support for TPUs is rare, and obtaining a speedup from TPUs will require adjusting your code.

How can I speed up deep learning on a non-NVIDIA setup?

Since I only have an AMD A10-7850 APU, and do not have the funds to spend on a $800-$1200 NVIDIA graphics card, I am trying to make due with the resources I have in order to speed up deep learning via tensorflow/keras.
Initially, I used a pre-compiled version of Tensorflow. InceptionV3 would take about 1000-1200 seconds to compute 1 epoch. It has been painfully slow.
To speed up calculations, I first self-compiled Tensorflow with optimizers (using AVX, and SSE4 instructions). This lead to a roughly 40% decrease in computation times. The same computations performed above now only take about 600 seconds to compute. It's almost bearable - kind of like you can watch paint dry.
I am looking for ways to further decrease computation times. I only have an integrated AMD graphics card that is part of the APU. (How) (C/c)an I make use of this resource to speed up computation even more?
More generally, let's say there are other people with similar monetary restrictions and Intel setups. How can anyone WITHOUT discrete NVIDIA cards make use of their integrated graphics chips or otherwise non-NVIDIA setup to achieve faster than CPU-only performance? Is that possible? Why/Why not? What needs to be done to achieve this goal? Or will this be possible in the near future (2-6 months)? How?

After researching this topic for a few months, I can see 3.5 possible paths forward:
1.) Tensorflow + OpenCl as mentioned in the comments above:
There seems to be some movement going on this field. Over at Codeplay, Lukasz Iwanski just posted a comprehensive answer on how to get tensorflow to run with opencl here (I will only provide a link as stated above because the information might change there): https://www.codeplay.com/portal/03-30-17-setting-up-tensorflow-with-opencl-using-sycl
The potential to use integrated graphics is alluring. It's also worth exploring the use of this combination with APUs. But I am not sure how well this will work since OpenCl support is still early in development, and hardware support is very limited. Furthermore, OpenCl is not the same as a handcrafted library of optimized code. (UPDATE 2017-04-24: I have gotten the code to compile after running into some issues here!) Unfortunately, the hoped for speed improvements ON MY SETUP (iGPU) did not materialize.
CIFAR 10 Dataset:
Tensorflow (via pip ak unoptimized): 1700sec/epoch at 390% CPU
utilization.
Tensorflow (SSE4, AVX): 1100sec/epoch at 390% CPU
utilization.
Tensorflow (opencl + iGPU): 5800sec/epoch at 150% CPU
and 100% GPU utilization.
Your mileage may vary significantly. So I am wondering what are other people getting relatively speaking (unoptimized vs optimized vs opencl) on your setups?
What should be noted: opencl implementation means that all the heavy computation should be done on the GPU. (Updated on 2017/4/29) But in reality this is not the case yet because some functions have not been implemented yet. This leads to unnecessary copying back and forth of data between CPU and GPU ram. Again, imminent changes should improve the situation. And furthermore, for those interested in helping out and those wanting to speed things up, we can do something that will have a measurable impact on the performance of tensorflow with opencl.
But as it stands for now: 1 iGPU << 4 CPUS with SSE+AVX. Perhaps beefier GPUs with larger RAM and/or opencl 2.0 implementation could have made a larger difference.
At this point, I should add that similar efforts have been going on with at least Caffe and/or Theano + OpenCl. The limiting step in all cases appears to be the manual porting of CUDA/cuDNN functionality to the openCl paradigm.
2.) RocM + MIOpen
RocM stands for Radeon Open Compute and seems to be a hodgepodge of initiatives that is/will make deep-learning possible on non-NVIDIA (mostly Radeon devices). It includes 3 major components:
HIP: A tool that converts CUDA code to code that can be consumed by AMD GPUs.
ROCk: a 64-bit linux kernel driver for AMD CPU+GPU devices.
HCC: A C/C++ compiler that compiles code into code for a heterogeneous system architecture environment (HSA).
Apparently, RocM is designed to play to AMDs strenghts of having both CPU and GPU technology. Their approach to speeding up deep-learning make use of both components. As an APU owner, I am particularly interested in this possibility. But as a cautionary note: Kaveri APUs have limited support (only integrated graphcs is supported). Future APUs have not been released yet. And it appears, there is still a lot of work that is being done here to bring this project to a mature state. A lot of work will hopefully make this approach viable within a year given that AMD has announced their Radeon Instinct cards will be released this year (2017).
The problem here for me is that that RocM is providing tools for building deep learning libraries. They do not themselves represent deep learning libraries. As a data scientist who is not focused on tools development, I just want something that works. and am not necessarily interested in building what I want to then do the learning. There are not enough hours in the day to do both well at the company I am at.
NVIDIA has of course CUDA and cuDNN which are libaries of hand-crafted assembler code optimized for NVIDIA GPUs. All major deep learning frameworks build on top of these proprietary libraries. AMD currently does not have anything like that at all.
I am uncertain how successfully AMD will get to where NVIDIA is in this regard. But there is some light being shone on what AMDs intentions are in an article posted by Carlos Perez on 4/3/2017 here. A recent lecture at Stanford also talks in general terms about Ryzen, Vega and deep learning fit together. In essence, the article states that MIOpen will represent this hand-crafted library of optimized deep learning functions for AMD devices. This library is set to be released in H1 of 2017. I am uncertain how soon these libraries would be incorporated into the major deep learning frameworks and what the scope of functional implementation will be then at this time.
But apparently, AMD has already worked with the developers of Caffe to "hippify" the code basis. Basically, CUDA code is converted automatically to C code via HIP. The automation takes care of the vast majority of the code basis, leaving only less than 0.5% of code had to be changed and required manual attention. Compare that to the manual translation into openCl code, and one starts getting the feeling that this approach might be more sustainable. What I am not clear about is where the lower-level assembler language optimization come in.
(Update 2017-05-19) But with the imminent release of AMD Vega cards (the professional Frontier Edition card not for consumers will be first), there are hints that major deep learning frameworks will be supported through the MIOpen framework. A Forbes article released today shows the progress MiOpen has taken over just the last couple of months in terms of performance: it appears significant.
(Update 2017-08-25) MiOpen has officially been released. We are no longer talking in hypotheticals here. Now we just need to try out how well this framework works.
3.) Neon
Neon is Nervana's (now acquired by Intel) open-source deep-learning framework. The reason I mention this framework is that it seems to be fairly straightforward to use. The syntax is about as easy and intuitive as Keras. More importantly though, this framework has achieved speeds up to 2x faster than Tensorflow on some benchmarks due to some hand-crafted assembler language optimization for those computations. Potentially, cutting computation times from 500 secs/epoch down to 300 secs/epoch is nothing to sneeze at. 300 secs = 5 minutes. So one could get 15 epochs in in an hour. and about 50 epochs in about 3.5 hours! But ideally, I want to do these kinds of calculations in under an hour. To get to those levels, I need to use a GPU, and at this point, only NVIDIA offers full support in this regard: Neon also uses CUDA and cuDNN when a GPU is available (and of course, it has to be an NVIDIA GPU). If you have access other Intel hardware this is of course a valid way to pursue. Afterall, Neon was developed out of a motivation to get things to work optimally also on non-NVIDIA setups (like Nervana's custom CPUs, and now Intel FPGAs or Xeon Phis).
3.5.) Intel Movidius
Update 2017-08-25: I came across this article. Intel has released a USB3.0-stick-based "deep learning" accelerator. Apparently, it works with Cafe and allows the user perform common Cafe-based fine-tuning of networks and inference. This is important stressing: If you want to train your own network from scratch, the wording is very ambiguous here. I will therefore assume, that apart from fine-tuning a network, training itself should still be done on something with more parallel compute. The real kicker though is this: When I checked for the pricing this stick costs a mere $79. That's nothing compared to the cost of your average NVIDIA 1070-80(ti) card. If you merely want to tackle some vision problems using common network topologies already available for some related tasks, you can use this stick to fine tune it to your own use, then compile the code and put it into this stick to do inference quickly. Many use cases can be covered with this stick, and for again $79 it could be worth it. This being Intel, they are proposing to go all out on Intel. Their model is to use the cloud (i.e. Nervana Cloud) for training. Then, use this chip for prototype inference or inference where energy consumption matters. Whether this is the right approach or not is left for the reader to answer.
At this time, it looks like deep learning without NVIDIA is still difficult to realize. Some limited speed gains are difficult but potentially possible through the use of opencl. Other initiatives sound promising but it will take time to sort out the real impact that these initiatives will have.

If your platform supports opencl you can look at using it with tensorflow. There is some experimental support for it on Linux at this github repository. Some preliminary instructions are at the documentation section of of this github repository.

Keras using GPU vs CPU

Are there any differences from the programming point of view(syntax,functions or any other ) in 'Keras with back-end as Tensorflow' while working on 'Keras GPU' and 'Keras CPU'? I meant if one program can run on a GPU enabled Keras, Will the same program run on Keras CPU(efficiency doesn't matter)?

GPU code running on CPU? Sure, it is basic Multithreading. And you needed to avoid the race conditions anyway.
For most intents a GPU is just a giant load of realy weak CPU's, wich allows highly effective Multithreading (basically a display-high times display-width Core).
The other way (running procedural CPU code in a massively paralell GPU enviroment) is where the work lies.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas