Where do Workers and Parameter Servers reside in Distributed TensorFlow? - tensorflow

In this post, it was mentioned that:
Also, there's no built-in distinction between worker and ps devices --
it's just a convention that variables get assigned to ps devices, and
ops are assigned to worker devices.
In this post, it was mentioned that:
TL;DR: TensorFlow doesn't know anything about "parameter servers", but
instead it supports running graphs across multiple devices in
different processes. Some of these processes have devices whose names
start with "/job:ps", and these hold the variables. The workers drive
the training process, and when they run the train_op they will cause
work to happen on the "/job:ps" devices, which will update the shared
variables.
Questions:
Do variables in ps reside on the CPU or GPU? Also, are there any performance gains if "/job:ps" resides on CPU or GPU?
Do the lower level libraries decide where to place a variable or operation?

Do variables in ps reside on the CPU or GPU? Also, are there any performance gains if "/job:ps" resides on CPU or GPU?
You can pin ps job to either on of those (with exceptions, see below), but pinning it to GPU is not practical. ps is really a storage of parameters and ops to update it. A CPU device can have a lot more memory (i.e., main RAM) than a GPU and is fast enough to update the parameters as the gradients are coming in. In most cases, matrix multiplications, convolutions and other expensive ops are done by the workers, hence a placement of a worker on a GPU makes sense. A placement of a ps to a GPU is a waste of resources, unless a ps job is doing something very specific and expensive.
But: Tensorflow does not currently have a GPU kernel for integer variables, so the following code will fail when Tensorflow tries to place the variable i on GPU #0:
with tf.device("/gpu:0"):
i = tf.Variable(3)
with tf.Session() as sess:
sess.run(i.initializer) # Fails!
with the following message:
Could not satisfy explicit device specification '/device:GPU:0'
because no supported kernel for GPU devices is available.
This is the case when there's no choice of device for a parameter, and thus for a parameter server: only CPU.
Do the lower level libraries decide where to place a variable or operation?
If I get this question right, node placement rules are pretty simple:
If a node was already placed on a device in a previous run of the graph, it is left on that device.
Else, if the user pinned a node to a device via tf.device, the placer places it on that device.
Else, it defaults to GPU #0, or the CPU if there is no GPU.
Tensorflow whitepaper describes also a dynamic placer, which is more sophisticated, but it's not part of the open source version of tensorflow right now.

Related

Is it possible to use system memory instead of GPU memory for processing Dask tasks

We have been running DASK clusters on Kubernetes for some time. Up to now, we have been using CPUs for processing and, of course, system memory for storing our Dataframe of around 1,5 TB (per DASK cluster, split onto 960 workers). Now we want to update our algorithm to take advantage of GPUs. But it seems like the available memory on GPUs is not going to be enough for our needs, it will be a limiting factor(with our current setup, we are using more than 1GB of memory per virtual core).
I was wondering if it is possible to use GPUs (thinking about NVDIA, AMD cards with PCIe connections and their own VRAMS, not integrated GPUs that use system memory) for processing and system memory (not GPU memory/VRAM) for storing DASK Dataframes. I mean, is it technically possible? Have you ever tried something like this? Can I schedule a kubernetes pod such that it uses GPU cores and system memory together?
Another thing is, even if it was possible to allocate the system RAM as VRAM of GPU, is there a limitation to the size of this allocatable system RAM?
Note 1. I know that using system RAM with GPU (if it was possible) will create an unnecessary traffic through PCIe bus, and will result in a degraded performance, but I would still need to test this configuration with real data.
Note 2. GPUs are fast because they have many simple cores to perform simple tasks at the same time/in parallel. If an individual GPU core is not superior to an individual CPU core then may be I am chasing the wrong dream? I am already running dask workers on kubernetes which already have access to hundreds of CPU cores. In the end, having a huge number of workers with a part of my data won't mean better performance (increased shuffling). No use infinitely increasing the number of cores.
Note 3. We are mostly manipulating python objects and doing math calculations using calls to .so libraries implemented in C++.
Edit1: DASK-CUDA library seems to support spilling from GPU memory to host memory but spilling is not what I am after.
Edit2: It is very frustrating that most of the components needed to utilize GPUs on Kubernetes are still experimental/beta.
Dask-CUDA: This library is experimental...
NVIDIA device plugin: The NVIDIA device plugin is still considered beta and...
Kubernetes: Kubernetes includes experimental support for managing AMD and NVIDIA GPUs...
I don't think this is possible directly as of today, but it's useful to mention why and reply to some of the points you've raised:
Yes, dask-cuda is what comes to mind first when I think of your use-case. The docs do say it's experimental, but from what I gather, the team has plans to continue to support and improve it. :)
Next, dask-cuda's spilling mechanism was designed that way for a reason -- while doing GPU compute, your biggest bottleneck is data-transfer (as you have also noted), so we want to keep as much data on GPU-memory as possible by design.
I'd encourage you to open a topic on Dask's Discourse forum, where we can reach out to some NVIDIA developers who can help confirm. :)
A sidenote, there are some ongoing discussion around improving how Dask manages GPU resources. That's in its early stages, but we may see cool new features in the coming months!

Profiling python-tensorflow-1.14

I'm profiling my tensorflow training and found out that some operations were performed on the CPU instead of the GPU.
When forcing the device with with graph.device("/gpu:0"), a slicing operation raised an error saying
Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
I "solved" it by forcing the device with with graph.device("/device:XLA_GPU:0") for the specific slicing operation. However, when profiling the process, I get (too) many transfers (MemcpyHtoD and MemcpyDtoH). In addition, even all the nodes of the graph are assigned to the GPU (or the XLA_GPU for the slicing operation), it appears that some variables sit on the CPU(see the Adam/VariableV2 on the image below, happening just after a MemcpyHtoD and followed by a MemcpyDtoH)!
I have trouble investigating and understanding how tensorflow handles the data, the operations and the device they are executed on.
Why isn't tensorflow able to run some very basic operations (like slicing) on the GPU?
Why are the transfers required between the host and the device for the Adam/VariableV2?
Are data transfer required between GPU and XLA_GPU?

What does SIMD mean?

I have read from the book "Operating System Internals and Design Principles" written by "William Stallings" that GPUs are Single-Instruction on Multiple Data, I don't get it what it means. I searched in google and got this assumption which I am not sure if it is right or wrong and that is:
SIMD GPU means the GPU processes only one instruction on an array of data, for example of a game, the GPU is only responsible for graphical representation of the game and the rest of calculation is being done by CPU, is it true.
In the context of GPU's, SIMD is a type of hardware architecture such that there are simultaneous (parallel) computations (Execution of an instruction), but only a single process (instruction) at a given moment.
Schematically, the SIMD architecture can be drawn as the following:
(credit for wikipedia: https://en.wikipedia.org/wiki/SIMD)
Data Pool in our context is the GPU memory & PU is a processing unit or execution unit (Cuda core in NVidia's GPU terms).
Bottom line - a single core of GPU can execute simultaneously the same instruction over different data.

Does TensorFlow use all of the hardware on the GPU?

The NVidia GP100 has 30 TPC circuits and 240 "texture units". Do the TPCs and texture units get used by TensorFlow, or are these disposable bits of silicon for machine learning?
I am looking at GPU-Z and Windows 10's built-in GPU performance monitor on a running neural net training session and I see various hardware functions are underutilized. Tensorflow uses CUDA. CUDA has access, I presume, to all hardware components. If I know where the gap is (between Tensorflow and underlying CUDA) and whether it is material (how much silicon is wasted) I can, for example, remediate by making a clone of TensorFlow, modifying it, and then submitting a pull request.
For example, answer below discusses texture objects, accessible from CUDA. NVidia notes that these can be used to speed up latency-sensitive, short-running kernels. If I google "TextureObject tensorflow" I don't get any hits. So I can sort of assume, barring evidence to the contrary, that TensorFlow is not taking advantage of TextureObjects.
NVidia markets GPGPUs for neural net training. So far it seems they have adopted a dual-use strategy for their circuits, so they are leaving in circuits not used for machine learning. This begs the question of whether a pure TensorFlow circuit would be more efficient. Google is now promoting TPUs for this reason. The jury is out on whether TPUs are actually cheaper for TensorFlow than NVidia GPUs. NVidia is challenging Google price/performance claims.
None of those things are separate pieces of individual hardware that can be addressed separately in CUDA. Read this passage on page 10 of your document:
Each GPC inside GP100 has ten SMs. Each SM has 64 CUDA Cores and four texture units. With 60 SMs,
GP100 has a total of 3840 single precision CUDA Cores and 240 texture units. Each memory controller is
attached to 512 KB of L2 cache, and each HBM2 DRAM stack is controlled by a pair of memory
controllers. The full GPU includes a total of 4096 KB of L2 cache.
And if we read just above that:
GP100 was built to be the highest performing parallel computing processor in the world to address the
needs of the GPU accelerated computing markets serviced by our Tesla P100 accelerator platform. Like
previous Tesla-class GPUs, GP100 is composed of an array of Graphics Processing Clusters (GPCs), Texture
Processing Clusters (TPCs), Streaming Multiprocessors (SMs), and memory controllers. A full GP100
consists of six GPCs, 60 Pascal SMs, 30 TPCs (each including two SMs), and eight 512-bit memory
controllers (4096 bits total).
and take a look at the diagram we see the following:
So not only are the GPCs and SMS not seperate pieces of hardware, but even the TPCs are just another way to reorganize the hardware architecture and come up with a fancy marketing name. You can clearly see TPC doesn't add anything new in the diagram, it just looks like a container for the SMs. Its [1 GPC]:[5 TPCs]:[10 SMs]
The memory controllers are something all hardware is going to have in order to interface with RAM, it happens that more memory controllers can enable higher bandwidth, see this diagram:
where "High bandwidth memory" refers to HBM2 a type of video memory like GDDR5, in other words, video RAM. This isn't something you would directly address in software with CUDA any more than you would do so with X86 desktop machines.
So in reality, we only have SMs here, not TPCs an GPCs. So to answer your question, since Tensor flow takes advantage of cuda, presumably its going to use all the available hardware it can.
EDIT: The poster edited their question to an entirely different question, and has new misconceptions there so here is the answer to that:
Texture Processing Clusters (TPCs) and Texture units are not the same thing. TPCs appear to be merely an organization of Streaming Multiprocessors (SM) with a bit of marketing magic thrown in.
Texture units are not a concrete term, and features differ from GPU to GPU, but basically you can think of them as the combination of texture memory or ready access to texture memory, which employs spatial coherence, versus L1,L2,L3... cache which employ temporal coherence, in combination of some fixed function functionality. Fixed functionality may include interpolation access filter (often at least linear interpolation), different coordinate modes, mipmapping control and ansiotropic texture filtering. See the Cuda 9.0 Guide on this topic to get an idea of texture unit functionality and what you can control with CUDA. On the diagram we can see the texture units at the bottom.
Clearly these are completely different from the TPCs shown in the first picture I posted, which at least according to the diagram have no extra functionality associated with them and are merely a container for two SMs.
Now, despite the fact that you can address texture functionality within cuda, you often don't need to. The texture units fixed function functionality is not all that useful to Neural nets, however, the spatially coherent texture memory is often automatically used by CUDA as an optimization even if you don't explicitly try to access it. In this way, TensorFlow still would not be "wasting" silicon.

How does Tensorflow support Cuda streams?

Does Tensorflow utilize Cuda streams automatically for concurrent execution of the computation graph on a single GPU or should streams be assigned manually to ops/tensors ?
For now, TensorFlow only uses one compute stream, and multiple copy streams. Some kernels may choose to use multiple streams for computation, while maintaining a single-stream semantics.
Our experiment showed that enabling multi-stream automatically does not bring much performance gains, since most of our kernels are large enough to utilize all processors in GPU. But enabling multi-stream would disable our current design to recycle GPU memory aggressively.
This is a decision we might revisit in the future. If that happens, it is likely for TensorFlow to automatically assign ops/kernels to different Cuda streams, without exposing them to users.