Device-wide synchronization in SYCL on NVIDIA GPUs - gpu

Context
I'm porting a complex CUDA application to SYCL which uses multiple cudaStream to launch the kernels. In addition, it also uses the default Stream in some cases, forcing a device-wide synchronization.
Problem
Cuda Streams can be mapped quite easily to in order SYCL Queues, however when encountering a device-wide syncronization point (i.e. cudaDeviceSyncronize()), I must explicitly wait on all the queues as queue::wait() waits just on the commands submitted to that queue.
Question
Is there a way to wait on all the commands for a specific device, without having to explicitly call wait() on every queue?

Related

In Vulkan can I receive notification in the form of a callback when a command buffer is finished?

I want to know when a command buffer has finished executing in Vulkan but rather than check myself regularly on a fence I was wondering if I could set a callback to be notified. Is this possible? The only callback I've seen I've seen mentioned is for something relating to allocations.
Vulkan is a explicit, low-level API. One of the design choices in such an API is that the graphics driver gets CPU time only when you give it CPU time (in theory). This allows your code more control over the CPU and its threads.
In order for Vulkan to call your code back at arbitrary times, it would have to have a CPU thread on which to do it. But as previously stated, that's not a thing Vulkan implementations are expected to have.
Even the allocation callbacks are only called in the scope of a Vulkan API function which performs those allocations. That is, the asynchronous processing of a Vulkan queue doesn't do allocations.
So you're going to have to either check manually or block a thread on the fence until it is released.

Does Vulkan parallel rendering relies on multiple queues?

I'm a newbie of Vulkan, and not very clear on how parallel rendering works, here's some question (the "queue" mentioned below refers specifically to the graphics queue:
Does parallel rendering relies on a device which supports more than one queue?
If question 1 is a yes, what if the physical device only have one queue, but Vulkan abstracted to 4 queues (which is the real case of my macbook's gpu), will the rendering in this case really parallel?
If question 1 is a yes, what if there is only one queue in Vulkan's abstraction, does that mean the device defiantly can render objects in parallel.
P.S. About question 2, when I use Metal api, the number of queues are only one, but when using Vulkan api, the number is 4, I'm not sure it is right to say "the physical device only have one queue".
I have the sneaking suspicion you are abusing the word "parallel". Make sure you know what it means.
Rendering on GPU is by nature embarrassingly parallel. Typically one queue can feed the entire GPU, and typically apps assume that is true.
In all likelihood they made the number of queues equal to the CPU core count. In Vulkan, submissions to a single queue always need to be externally synchronized. Having more queues allows to submit from multiple threads without synchronization.
If there is only one Vulkan queue, you can only submit to one queue. And any submission has to be synchronized with mutex or coming only from one thread in the first place.

TensorFlow Execution on a single (multi-core) CPU Device

I have some questions regarding the execution model of TensorFlow in the specific case in which there is only a CPU device and the network is used only for inference, for instance using the Image Recognition(https://www.tensorflow.org/tutorials/image_recognition) C++ Example with a multi-core platform.
In the following, I will try to summarize what I understood, while asking some questions.
Session->Run() (file direct_session.cc) calls ExecutorState::RynAsynch, which initializes the TensorFlow ready queue with the roots nodes.
Then, the instruction
runner_([=]() { Process(tagged_node, scheduled_usec); }); (executor.cc, function ScheduleReady, line 2088)
assigns the node (and hence the related operation) to a thread of the inter_op pool.
However, I do not fully understand how it works.
For instance, in the case in which ScheduleReady is trying to assign more operations than the size of the inter_op pool, how operations are enqueued?(FIFO Order?)
Each thread of a pool has a queue of operation or there is a single shared queue?
Where can I found this in the code?
Where can I found the body of each thread of the pools?
Another question regards the nodes managed by inline_ready. How the execution of these (inexpensive or dead) nodes, differs from the one of the other nodes?
Then, (still, to my understanding) the execution flow continues from ExecutorState::Process, which executes the operation, distinguishing between synchronous and asynchronous operations.
How synchronous and asynchronous operations differs in terms of execution?
When the operation is executed, then PropagateOutputs (which calls ActivateNodes) adds to the ready queue the node of every successor which is become ready thanks to the execution of the current node(predecessor).
Finally, NodeDone() calls ScheduleReady() which process the nodes currently in the TensorFlow ready queue.
Conversely, how the intra_op thread pool is managed depends on the specific kernel, right? It is possible that a kernel requests more operations than the intra_op thread pool size?
If yes, with which kind of ordering they are enqueued? (FIFO?)
Once operations are assigned to threads of the pool, then their scheduling is left to the underlying operating system or TensorFlow enforces some kind of scheduling policy?
I'm asking here because I didn't find almost anything about this part of the execution model in the documentation, if I missed some documents please point me to all of them.
Re ThreadPool: When Tensorflow uses DirectSession (as it does in your case), it uses Eigen's ThreadPool. I could not get a web link to the official version of Eigen used in TensorFlow, but here is a link to the thread pool code. This thread pool is using this queue implementation RunQueue. There is one queue per thread.
Re inline_ready: Executor:Process is scheduled in some Eigen Thread. When it runs it executes some nodes. As these nodes are done, they make other nodes (tensorflow operations) ready. Some of these nodes are not expensive. They are added to inline_ready and executed in the same thread, without yielding. Other nodes are expensive and are not executed "immediately" in the same thread. Their execution is scheduled through the Eigen thread pool.
Re sync/async kernels: Tensorflow operations can be backed by synchronous (most CPU kernels) or asynchronous kernels (most GPU kernels). Synchronous kernels are executed in the thread running Process. Asynchronous kernels are dispatched to their device (usually GPU) to be executed. When asynchronous kernels are done, they invoke NodeDone method.
Re Intra Op ThreadPool: The intra op thread pool is made available to kernels to run their computation in parallel. Most CPU kernels don't use it (and GPU kernels just dispatch to GPU) and run synchronously in the thread that called the Compute method. Depending on configuration there is either one intra op thread pool shared by all devices (CPUs), or each device has its own. Kernels simply schedule their work on this thread pool. Here is an example of one such kernel. If there are more tasks than threads, they are scheduled and executed in unspecified order. Here is the ThreadPool interface exposed to kernels.
I don't know of any way tensorflow influences the scheduling of OS threads. You can ask it to do some spinning (i.e. not immediately yield the thread to OS) to minimize latency (from OS scheduling), but that is about it.
These internal details are not documented on purpose as they are subject to change. If you are using tensorflow through Python API, all you should need to know that your ops will execute when their inputs are ready. If you want to enforce some order beyond this, you should use:
with tf.control_dependencies(<tensors_that_you_want_computed_before_the_ops_inside_this_block>):
tf.foo_bar(...)
If you are writing a custom CPU kernel and want to do parallelism inside it (usually needed rarely for very expensive kernels), the thread pool interface linked above is what you can rely on.

How do I mitigate CUDA's very long initialization delay?

Initializing CUDA in a newly-created process can take quite some time as long as a half-second or more on many server-grade machines of today. As #RobertCrovella explains, CUDA initialization usually includes establishment of a Unified Memory model, which involves harmonizing of device and host memory maps. This can take quite a long time for machines with a lot of memory; and there might be other factors contributing to this long delay.
This effect becomes quite annoying when you want to run a sequence of CUDA-utilizing processes, which do not use complicated virtual memory mappings: They each have to wait their their long wait - despite the fact that "essentially", they could just re-use whether initializations CUDA made the last time (perhaps with a bit of cleanup code).
Now, obviously, if you somehow rewrote the code for all those processes to execute within a single process - that would save you those long initialization costs. But isn't there a simpler approach? What about:
Passing the same state information / CUDA context between processes?
Telling CUDA to ignore most host memory altogether?
Making the Unified Memory harmonization more lazy than it is now, so that it only happens to the extent that it's actually necessary?
Starting CUDA with Unified Memory disabled?
Keeping some daemon process on the side and latching on to it's already-initialized CUDA state?
What you are asking about already exists. It is called MPS (MULTI-PROCESS SERVICE), and it basically keeps a single GPU context alive at all times with a daemon process that emulates the driver API. The initial target application is MPI, but it does bascially what you envisage.
Read more here:
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
http://on-demand.gputechconf.com/gtc/2015/presentation/S5584-Priyanka-Sah.pdf

TensorFlow Device Contexts, Streams and Context Switching

In the GPUDevice code, I noticed that one GPUDeviceContext is made per stream.
Is the purpose of this so that every context can control one OpKernelContext and then as the various streams need to be executed, then the contexts can just be switched which handles pushing different data/code onto the GPU and then executing.
Do the various streams get registered as different devices (ie. '/gpu:0' and '/gpu:1')?
Per this, ThreadPoolDevice's don't have contexts, but if I were to add contexts into ThreadPoolDevice, would they fit best as a sort of ThreadContext?
For GPU, we maintain a few streams for execution: a compute stream (on which most computational kernels run), and some memcopy streams (for executing memcopies between host and device and vice versa). This is done to overlap communication and computation on GPU devices, but is particular to the way that we use GPUs. One could easily also just create one GPU stream for all computation and communication and it would be correct, although slower.
We want to give the computation stream to kernels that do computation, and the memcopy stream to the kernels that do copying. We create a GPUDeviceContext object for each stream, and then pass the right device context object to the OpKernelContext.
So the particular implementations here reflect the properties of the asynchronous hardware device (GPU), which is why the ThreadPoolDevice doesn't have these sorts of mechanisms. On CPU all computation is synchronous, so there is no need for an abstraction such as streams.
The execution model of the custom hardware will likely determine what kind of state and management a custom device support will require in TensorFlow.