How does Tensorflow support Cuda streams? - tensorflow

Does Tensorflow utilize Cuda streams automatically for concurrent execution of the computation graph on a single GPU or should streams be assigned manually to ops/tensors ?

For now, TensorFlow only uses one compute stream, and multiple copy streams. Some kernels may choose to use multiple streams for computation, while maintaining a single-stream semantics.
Our experiment showed that enabling multi-stream automatically does not bring much performance gains, since most of our kernels are large enough to utilize all processors in GPU. But enabling multi-stream would disable our current design to recycle GPU memory aggressively.
This is a decision we might revisit in the future. If that happens, it is likely for TensorFlow to automatically assign ops/kernels to different Cuda streams, without exposing them to users.

Related

Is it possible to use system memory instead of GPU memory for processing Dask tasks

We have been running DASK clusters on Kubernetes for some time. Up to now, we have been using CPUs for processing and, of course, system memory for storing our Dataframe of around 1,5 TB (per DASK cluster, split onto 960 workers). Now we want to update our algorithm to take advantage of GPUs. But it seems like the available memory on GPUs is not going to be enough for our needs, it will be a limiting factor(with our current setup, we are using more than 1GB of memory per virtual core).
I was wondering if it is possible to use GPUs (thinking about NVDIA, AMD cards with PCIe connections and their own VRAMS, not integrated GPUs that use system memory) for processing and system memory (not GPU memory/VRAM) for storing DASK Dataframes. I mean, is it technically possible? Have you ever tried something like this? Can I schedule a kubernetes pod such that it uses GPU cores and system memory together?
Another thing is, even if it was possible to allocate the system RAM as VRAM of GPU, is there a limitation to the size of this allocatable system RAM?
Note 1. I know that using system RAM with GPU (if it was possible) will create an unnecessary traffic through PCIe bus, and will result in a degraded performance, but I would still need to test this configuration with real data.
Note 2. GPUs are fast because they have many simple cores to perform simple tasks at the same time/in parallel. If an individual GPU core is not superior to an individual CPU core then may be I am chasing the wrong dream? I am already running dask workers on kubernetes which already have access to hundreds of CPU cores. In the end, having a huge number of workers with a part of my data won't mean better performance (increased shuffling). No use infinitely increasing the number of cores.
Note 3. We are mostly manipulating python objects and doing math calculations using calls to .so libraries implemented in C++.
Edit1: DASK-CUDA library seems to support spilling from GPU memory to host memory but spilling is not what I am after.
Edit2: It is very frustrating that most of the components needed to utilize GPUs on Kubernetes are still experimental/beta.
Dask-CUDA: This library is experimental...
NVIDIA device plugin: The NVIDIA device plugin is still considered beta and...
Kubernetes: Kubernetes includes experimental support for managing AMD and NVIDIA GPUs...
I don't think this is possible directly as of today, but it's useful to mention why and reply to some of the points you've raised:
Yes, dask-cuda is what comes to mind first when I think of your use-case. The docs do say it's experimental, but from what I gather, the team has plans to continue to support and improve it. :)
Next, dask-cuda's spilling mechanism was designed that way for a reason -- while doing GPU compute, your biggest bottleneck is data-transfer (as you have also noted), so we want to keep as much data on GPU-memory as possible by design.
I'd encourage you to open a topic on Dask's Discourse forum, where we can reach out to some NVIDIA developers who can help confirm. :)
A sidenote, there are some ongoing discussion around improving how Dask manages GPU resources. That's in its early stages, but we may see cool new features in the coming months!

How the GPU process non-graphic data in parallel?

As the introduction of programmable shaders in graphic pipeline enabled GPGPU concept which makes use of GPU as a general processing engine suited for parallel data.
However, as far as I know, because GPU is still used for graphic processing a lot compared to GPGPU, it makes use of lots of fixed graphic pipeline stages that cannot be programmed.
If my understanding is correct, when one data is processed by the GPU regardless of the type of data (graphic or general), it should be processed through the fixed graphic pipeline which includes programmable stages and non-programmable fixed stages.
Does that mean non-graphical processing should go through graphical processing stages even though it doesn't make use of it? Or can it bypass those fixed stages used for graphics? If one can explain how the GPU pipeline works for GPGPU I would appreciate it.
TL;DR:
GPGPU completely bypasses the rendering pipeline, but the pipeline is still used today.
GPUs consist of two main parts (in relation to your question). The first one is the processing part, which consists of the memory, registers, warp units, dispatchers and streaming processors. The other part is a set of controllers, that are responsible for geometry processing and the graphics pipeline. Those controllers just issue commands for the Streaming Processors on how to process the data for each of the steps of the rendering pipeline, either hardwired or based on user supplied shaders. NVidia calls them "PolyMorph Engine", AMD "Geometric Processor".
Historically, some of those controllers were hardwired to do things a single way, so you could only programm the vertexshader, fragmentshader and pixelshader. The tesselation controller e.g. was hardwired on the GPU and not user programmable. As demands grew, more and more of those controllers became user-programmable and today most of them are completely programmable (Wikipedia).
In the beginning days of GPGPU, the only way to do computing was to hack the available shaders by using a texture with your input data on a full-screen face to calculate the result and then read the rendered image back (See slide 26 on this introduction).
With CUDA, NVidia allowed users not only to program the shaders/polymorph Engine, but also directly interact with the Streaming processors and execute code on those (See slide 31 & 32).
This does not mean, that the graphics pipeline became obsolete, but now there is a way to completely bypass it and directly run code on the GPU processors. Nvidia has a nice explanation on how the pipeline works today, where you can also see both the PolyMorph Engine and the Streaming Processors here.
The Graphics pipeline still helps the dev by offloading repetitive and more complicated parts of the process, like managing the memory, managing warps, passing data and all that stuff. Theoretically you could probably write your own pipeline directly on the StreamingProcessors using CUDA and then render the result, but it would be tedious. Just how writing a GPGPU-Code using Shaders would be tedious.
Although old GPUs have pipelines hardcoded in the chip, modern GPU itself is just a large ASIC that can compute vectorized data at stupid fast speed. It is human who defines what it can do. So the render pipeline is defined in the graphics library like OpenGL, not in GPU. Thus, GPU does not care what it is computing, as long as it is vectorized data, it can do all the computation needed and give you a result.

Profiling python-tensorflow-1.14

I'm profiling my tensorflow training and found out that some operations were performed on the CPU instead of the GPU.
When forcing the device with with graph.device("/gpu:0"), a slicing operation raised an error saying
Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
I "solved" it by forcing the device with with graph.device("/device:XLA_GPU:0") for the specific slicing operation. However, when profiling the process, I get (too) many transfers (MemcpyHtoD and MemcpyDtoH). In addition, even all the nodes of the graph are assigned to the GPU (or the XLA_GPU for the slicing operation), it appears that some variables sit on the CPU(see the Adam/VariableV2 on the image below, happening just after a MemcpyHtoD and followed by a MemcpyDtoH)!
I have trouble investigating and understanding how tensorflow handles the data, the operations and the device they are executed on.
Why isn't tensorflow able to run some very basic operations (like slicing) on the GPU?
Why are the transfers required between the host and the device for the Adam/VariableV2?
Are data transfer required between GPU and XLA_GPU?

TensorFlow Serving and serving more models than the memory can allow

TensorFlow Serving can serve multiple models by configuring the --model_config_file command line argument. I had success using this feature in small experiments.
However, it's unclear to me what happens when the total memory required by these models is larger than, say, the available GPU memory.
Does the server just crash? Or does it support keeping a subset of models available and possibly unloading/loading models based on the usage?
Thanks.
Trying to load a model when you are out of memory will fail to load that model. There's no dynamic loading/unloading at this time.
As currently written, it will crash if there isn't enough memory for all of the models requested to load. Internally there is a feature to gracefully decline to load a model that doesn't fit, which you could enable by writing a small PR that pipes the ServerCore::Options::total_model_memory_limit_bytes option [1] to a flag in main.cc. Note, however, that the notion of "fitting in memory" is based on a somewhat crude way of estimating model RAM footprint.
As Gautam said, it does not dynamically load/unload, although there is a library implemented for that (which isn't currently used in the released binary), called CachingManager [2].
[1] https://github.com/tensorflow/serving/blob/master/tensorflow_serving/model_servers/server_core.h#L112
[2] https://github.com/tensorflow/serving/blob/master/tensorflow_serving/core/caching_manager.h

Where do Workers and Parameter Servers reside in Distributed TensorFlow?

In this post, it was mentioned that:
Also, there's no built-in distinction between worker and ps devices --
it's just a convention that variables get assigned to ps devices, and
ops are assigned to worker devices.
In this post, it was mentioned that:
TL;DR: TensorFlow doesn't know anything about "parameter servers", but
instead it supports running graphs across multiple devices in
different processes. Some of these processes have devices whose names
start with "/job:ps", and these hold the variables. The workers drive
the training process, and when they run the train_op they will cause
work to happen on the "/job:ps" devices, which will update the shared
variables.
Questions:
Do variables in ps reside on the CPU or GPU? Also, are there any performance gains if "/job:ps" resides on CPU or GPU?
Do the lower level libraries decide where to place a variable or operation?
Do variables in ps reside on the CPU or GPU? Also, are there any performance gains if "/job:ps" resides on CPU or GPU?
You can pin ps job to either on of those (with exceptions, see below), but pinning it to GPU is not practical. ps is really a storage of parameters and ops to update it. A CPU device can have a lot more memory (i.e., main RAM) than a GPU and is fast enough to update the parameters as the gradients are coming in. In most cases, matrix multiplications, convolutions and other expensive ops are done by the workers, hence a placement of a worker on a GPU makes sense. A placement of a ps to a GPU is a waste of resources, unless a ps job is doing something very specific and expensive.
But: Tensorflow does not currently have a GPU kernel for integer variables, so the following code will fail when Tensorflow tries to place the variable i on GPU #0:
with tf.device("/gpu:0"):
i = tf.Variable(3)
with tf.Session() as sess:
sess.run(i.initializer) # Fails!
with the following message:
Could not satisfy explicit device specification '/device:GPU:0'
because no supported kernel for GPU devices is available.
This is the case when there's no choice of device for a parameter, and thus for a parameter server: only CPU.
Do the lower level libraries decide where to place a variable or operation?
If I get this question right, node placement rules are pretty simple:
If a node was already placed on a device in a previous run of the graph, it is left on that device.
Else, if the user pinned a node to a device via tf.device, the placer places it on that device.
Else, it defaults to GPU #0, or the CPU if there is no GPU.
Tensorflow whitepaper describes also a dynamic placer, which is more sophisticated, but it's not part of the open source version of tensorflow right now.