Profiling python-tensorflow-1.14 - tensorflow

I'm profiling my tensorflow training and found out that some operations were performed on the CPU instead of the GPU.
When forcing the device with with graph.device("/gpu:0"), a slicing operation raised an error saying
Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
I "solved" it by forcing the device with with graph.device("/device:XLA_GPU:0") for the specific slicing operation. However, when profiling the process, I get (too) many transfers (MemcpyHtoD and MemcpyDtoH). In addition, even all the nodes of the graph are assigned to the GPU (or the XLA_GPU for the slicing operation), it appears that some variables sit on the CPU(see the Adam/VariableV2 on the image below, happening just after a MemcpyHtoD and followed by a MemcpyDtoH)!
I have trouble investigating and understanding how tensorflow handles the data, the operations and the device they are executed on.
Why isn't tensorflow able to run some very basic operations (like slicing) on the GPU?
Why are the transfers required between the host and the device for the Adam/VariableV2?
Are data transfer required between GPU and XLA_GPU?

Related

What does SIMD mean?

I have read from the book "Operating System Internals and Design Principles" written by "William Stallings" that GPUs are Single-Instruction on Multiple Data, I don't get it what it means. I searched in google and got this assumption which I am not sure if it is right or wrong and that is:
SIMD GPU means the GPU processes only one instruction on an array of data, for example of a game, the GPU is only responsible for graphical representation of the game and the rest of calculation is being done by CPU, is it true.
In the context of GPU's, SIMD is a type of hardware architecture such that there are simultaneous (parallel) computations (Execution of an instruction), but only a single process (instruction) at a given moment.
Schematically, the SIMD architecture can be drawn as the following:
(credit for wikipedia: https://en.wikipedia.org/wiki/SIMD)
Data Pool in our context is the GPU memory & PU is a processing unit or execution unit (Cuda core in NVidia's GPU terms).
Bottom line - a single core of GPU can execute simultaneously the same instruction over different data.

Where do Workers and Parameter Servers reside in Distributed TensorFlow?

In this post, it was mentioned that:
Also, there's no built-in distinction between worker and ps devices --
it's just a convention that variables get assigned to ps devices, and
ops are assigned to worker devices.
In this post, it was mentioned that:
TL;DR: TensorFlow doesn't know anything about "parameter servers", but
instead it supports running graphs across multiple devices in
different processes. Some of these processes have devices whose names
start with "/job:ps", and these hold the variables. The workers drive
the training process, and when they run the train_op they will cause
work to happen on the "/job:ps" devices, which will update the shared
variables.
Questions:
Do variables in ps reside on the CPU or GPU? Also, are there any performance gains if "/job:ps" resides on CPU or GPU?
Do the lower level libraries decide where to place a variable or operation?
Do variables in ps reside on the CPU or GPU? Also, are there any performance gains if "/job:ps" resides on CPU or GPU?
You can pin ps job to either on of those (with exceptions, see below), but pinning it to GPU is not practical. ps is really a storage of parameters and ops to update it. A CPU device can have a lot more memory (i.e., main RAM) than a GPU and is fast enough to update the parameters as the gradients are coming in. In most cases, matrix multiplications, convolutions and other expensive ops are done by the workers, hence a placement of a worker on a GPU makes sense. A placement of a ps to a GPU is a waste of resources, unless a ps job is doing something very specific and expensive.
But: Tensorflow does not currently have a GPU kernel for integer variables, so the following code will fail when Tensorflow tries to place the variable i on GPU #0:
with tf.device("/gpu:0"):
i = tf.Variable(3)
with tf.Session() as sess:
sess.run(i.initializer) # Fails!
with the following message:
Could not satisfy explicit device specification '/device:GPU:0'
because no supported kernel for GPU devices is available.
This is the case when there's no choice of device for a parameter, and thus for a parameter server: only CPU.
Do the lower level libraries decide where to place a variable or operation?
If I get this question right, node placement rules are pretty simple:
If a node was already placed on a device in a previous run of the graph, it is left on that device.
Else, if the user pinned a node to a device via tf.device, the placer places it on that device.
Else, it defaults to GPU #0, or the CPU if there is no GPU.
Tensorflow whitepaper describes also a dynamic placer, which is more sophisticated, but it's not part of the open source version of tensorflow right now.

TensorFlow Device Contexts, Streams and Context Switching

In the GPUDevice code, I noticed that one GPUDeviceContext is made per stream.
Is the purpose of this so that every context can control one OpKernelContext and then as the various streams need to be executed, then the contexts can just be switched which handles pushing different data/code onto the GPU and then executing.
Do the various streams get registered as different devices (ie. '/gpu:0' and '/gpu:1')?
Per this, ThreadPoolDevice's don't have contexts, but if I were to add contexts into ThreadPoolDevice, would they fit best as a sort of ThreadContext?
For GPU, we maintain a few streams for execution: a compute stream (on which most computational kernels run), and some memcopy streams (for executing memcopies between host and device and vice versa). This is done to overlap communication and computation on GPU devices, but is particular to the way that we use GPUs. One could easily also just create one GPU stream for all computation and communication and it would be correct, although slower.
We want to give the computation stream to kernels that do computation, and the memcopy stream to the kernels that do copying. We create a GPUDeviceContext object for each stream, and then pass the right device context object to the OpKernelContext.
So the particular implementations here reflect the properties of the asynchronous hardware device (GPU), which is why the ThreadPoolDevice doesn't have these sorts of mechanisms. On CPU all computation is synchronous, so there is no need for an abstraction such as streams.
The execution model of the custom hardware will likely determine what kind of state and management a custom device support will require in TensorFlow.

How does Tensorflow support Cuda streams?

Does Tensorflow utilize Cuda streams automatically for concurrent execution of the computation graph on a single GPU or should streams be assigned manually to ops/tensors ?
For now, TensorFlow only uses one compute stream, and multiple copy streams. Some kernels may choose to use multiple streams for computation, while maintaining a single-stream semantics.
Our experiment showed that enabling multi-stream automatically does not bring much performance gains, since most of our kernels are large enough to utilize all processors in GPU. But enabling multi-stream would disable our current design to recycle GPU memory aggressively.
This is a decision we might revisit in the future. If that happens, it is likely for TensorFlow to automatically assign ops/kernels to different Cuda streams, without exposing them to users.

Can GPU be used to run programs that run on CPU?

Can Gpu be used to run programs that run on Cpu like getting input from keyboard and mouse or playing music or reading the contents of a text file using Direct3D and OpenGL Api?
The GPU has no direct access on any memory that is mapped by the OS to be accessed within client code (i.e. code, which is executed in user-mode while the instructions are executed on the CPU).
In addition the GPU is not supposed to perform stuff like this, it aims to perform floating point arithmetic at a high speed. And finally you would never use Direct3D or OpenGL to perform anything that is not related to graphics, except you are only going to use the compute shader.
General purpose computations are performed with OpenCL or CUDA on the GPU, such as image manipulation or physics simulations.
You can, however, gather any data on the CPU, send it to the GPU for further processing and finally write it back again into memory accessible from the CPU.