TensorFlow Device Contexts, Streams and Context Switching - tensorflow

In the GPUDevice code, I noticed that one GPUDeviceContext is made per stream.
Is the purpose of this so that every context can control one OpKernelContext and then as the various streams need to be executed, then the contexts can just be switched which handles pushing different data/code onto the GPU and then executing.
Do the various streams get registered as different devices (ie. '/gpu:0' and '/gpu:1')?
Per this, ThreadPoolDevice's don't have contexts, but if I were to add contexts into ThreadPoolDevice, would they fit best as a sort of ThreadContext?

For GPU, we maintain a few streams for execution: a compute stream (on which most computational kernels run), and some memcopy streams (for executing memcopies between host and device and vice versa). This is done to overlap communication and computation on GPU devices, but is particular to the way that we use GPUs. One could easily also just create one GPU stream for all computation and communication and it would be correct, although slower.
We want to give the computation stream to kernels that do computation, and the memcopy stream to the kernels that do copying. We create a GPUDeviceContext object for each stream, and then pass the right device context object to the OpKernelContext.
So the particular implementations here reflect the properties of the asynchronous hardware device (GPU), which is why the ThreadPoolDevice doesn't have these sorts of mechanisms. On CPU all computation is synchronous, so there is no need for an abstraction such as streams.
The execution model of the custom hardware will likely determine what kind of state and management a custom device support will require in TensorFlow.

Related

Rendering Terrain Dynamically with Argument Buffers : Understanding why the particle buffer is not overwritten by the GPU inflight

I am looking through an Apple demo project that is associated with the 2017 WWDC video entitled "Introducing Metal 2" where the developers demonstrate the use of argument buffers. The project is linked here on the page titled "Rendering Terrain Dynamically with Argument Buffers" on the Apple developer website. Here, they synchronize resource writes by the CPU to prevent race conditions with a dispatch_semaphore_t, signaling it when the command buffer finishes executing on the GPU and waiting on it if the CPU is writing data several frames ahead of the GPU. This is consistent with what was shown in a previous 2014 WWDC "Working With Metal: Fundamentals".
I noticed that it seems the APPLParticleRenderer is sending data to be written by the GPU in a compute pass before it finishes reading from that same buffer from the fragment shader from a previous render pass. The resource storage mode of the buffer is MTLResourceStorageModePrivate. My question: does Metal automatically synchronize access to private id<MTLBuffer>s accessible only by the GPU? Do render, compute, and blit passes called from new id<MTLCommandEncoder> have access to the buffer only after other passes have written and read from it (exclusive access)? I have seen that there are guaranteed barriers within tile shaders, where tile memory is accessed exclusively by the kernel before subsequent fragment shaders access the memory.
Lastly, in the 2016 WWDC "What's New in Metal, Part 2", the first presenter, Charles Brissart, at 16:44 mentions that fragment and vertex functions reading and writing from the same buffer must be placed into two render command encoders, but for compute kernels one compute command encoder suffices. This is consistent with what is seen within the particle renderer.
See my comment on the original question for a brief version of this answer.
It turns out that Metal tracks dependencies between commands scheduled to the GPU by default for MTLResource types. The hazardTrackingMode property of a MTLResource is defaulted to MTLHazardTrackingModeTracked (MTLHazardTrackingMode.tracked in Swift) according to the Metal documentation. This means Metal tracks dependencies across commands that modify the resource, as is the case with the particle kernel, and delays execution until prior commands accessing the resource are complete.
Therefore, since the _particleDataPool buffer has a storage mode of MTLResourceStorageModePrivate (storageModePrivate in Swift), it can only be written to by the GPU; hence, no CPU/GPU synchronization is necessary with a semaphore for this buffer and thus no multi-buffer system is necessary for the resource.
Only when a resource can be written to by the CPU while the GPU is still reading from it do we want multiple buffers so the CPU is not idle.
Note that the default hazard tracking mode for a MTLHeap is MTLHazardTrackingModeUntracked (MTLHazardTrackingMode.untracked in Swift), in which case you are responsible for synchronizing resource writes by the GPU
EDIT
After reading into resource synchronization in Metal, there are some additional points I would like to make that I think further clarify what's going on. Note that the remaining portion is in Swift. To learn more in detail, I recommend reading the "Synchronization" section in the Metal documentation here.
MTLFence
Firstly, a MTLFence is used to synchronize accesses to untracked resources within the execution of a single command buffer. A fence gives you explicit control over when the GPU accesses resources and is necessary when you are working with an untracked resource. Otherwise, Metal will handle this synchronization for you
It is important to note that the automatic management I mention in the answer only occurs within a single command buffer between encoding passes. But this does not mean we need to synchronize across command buffers scheduled in the same command queue since a command buffer is not immediately scheduled for execution. In fact, according to the documentation on the addScheduledHandler(_:) method of the MTLCommandBuffer protocol found here
The device object schedules the command buffer after it identifies any dependencies with work tasks submitted by other command buffers or other APIs in the system.
at which point it would be safe to access these same buffers. Note that within a single render encoding pass, it is important to mention that if a vertex shader writes into a buffer the fragment shader in the same pass reads from, this is undefined. I mentioned this in the original question, the solution being to use two render pass encoders. I have yet to determine why this is not necessary for a compute encoder, but I imagine it has to do with how kernels are executed in comparison to vertex and fragment shaders
MTLEvent
In some cases, however, command buffers in different queues created by the same MTLDevice need access to the same resource or depend on one another in some way. In this case, synchronization is necessary because the separate queues schedule their own command buffers without knowledge of the other, meaning there is potential for the two command buffers to be executing concurrently.
To fix this problem, you use an MTLEvent instance created by the device using makeEvent() and encode event signals at specific points in each buffer.
MTLSharedEvent
In the event (no pun intended) that you have multiple processors (different CPU cores, CPU and GPU, or multi-GPU), resource synchronization is needed. Here, you create a MTLSharedEvent in place of a MTLEvent that can be used to synchronize across devices and processes. It is essentially the same API as that of the MTLEvent, but involves command queues on different devices.

How the GPU process non-graphic data in parallel?

As the introduction of programmable shaders in graphic pipeline enabled GPGPU concept which makes use of GPU as a general processing engine suited for parallel data.
However, as far as I know, because GPU is still used for graphic processing a lot compared to GPGPU, it makes use of lots of fixed graphic pipeline stages that cannot be programmed.
If my understanding is correct, when one data is processed by the GPU regardless of the type of data (graphic or general), it should be processed through the fixed graphic pipeline which includes programmable stages and non-programmable fixed stages.
Does that mean non-graphical processing should go through graphical processing stages even though it doesn't make use of it? Or can it bypass those fixed stages used for graphics? If one can explain how the GPU pipeline works for GPGPU I would appreciate it.
TL;DR:
GPGPU completely bypasses the rendering pipeline, but the pipeline is still used today.
GPUs consist of two main parts (in relation to your question). The first one is the processing part, which consists of the memory, registers, warp units, dispatchers and streaming processors. The other part is a set of controllers, that are responsible for geometry processing and the graphics pipeline. Those controllers just issue commands for the Streaming Processors on how to process the data for each of the steps of the rendering pipeline, either hardwired or based on user supplied shaders. NVidia calls them "PolyMorph Engine", AMD "Geometric Processor".
Historically, some of those controllers were hardwired to do things a single way, so you could only programm the vertexshader, fragmentshader and pixelshader. The tesselation controller e.g. was hardwired on the GPU and not user programmable. As demands grew, more and more of those controllers became user-programmable and today most of them are completely programmable (Wikipedia).
In the beginning days of GPGPU, the only way to do computing was to hack the available shaders by using a texture with your input data on a full-screen face to calculate the result and then read the rendered image back (See slide 26 on this introduction).
With CUDA, NVidia allowed users not only to program the shaders/polymorph Engine, but also directly interact with the Streaming processors and execute code on those (See slide 31 & 32).
This does not mean, that the graphics pipeline became obsolete, but now there is a way to completely bypass it and directly run code on the GPU processors. Nvidia has a nice explanation on how the pipeline works today, where you can also see both the PolyMorph Engine and the Streaming Processors here.
The Graphics pipeline still helps the dev by offloading repetitive and more complicated parts of the process, like managing the memory, managing warps, passing data and all that stuff. Theoretically you could probably write your own pipeline directly on the StreamingProcessors using CUDA and then render the result, but it would be tedious. Just how writing a GPGPU-Code using Shaders would be tedious.
Although old GPUs have pipelines hardcoded in the chip, modern GPU itself is just a large ASIC that can compute vectorized data at stupid fast speed. It is human who defines what it can do. So the render pipeline is defined in the graphics library like OpenGL, not in GPU. Thus, GPU does not care what it is computing, as long as it is vectorized data, it can do all the computation needed and give you a result.

Tensorflow Cross Device Communication

As the tensorflow paper states, Tensorflow' cross-device communication is achieved by adding "receive node" and "send node" into devices.
From my understanding, the device(Please considering only CPU devices are involved) is responsible for performing the computation of an operation. However,the data(ex:Tensor produced from an operation, Variable buffer) resides in memory. I don't know how data transfer from one device to another device is achieved physically. I guess the data transfer is achieved by shared memory. Is that right?
I will appreciate any explanation/corresponding codes regarding how the data transfer is achieved.
PS: TensorFlow paper link, Figure 4 shows the cross-device communication mechanism.
In TensorFlow, cross-device communication is achieved using the Rendezvous interface, which has multiple different implementations, depending on the deployment. The comment on that interface describes the general idea:
// A Rendezvous is an abstraction for passing a Tensor
// from a producer to a consumer, where the consumer may safely
// request the Tensor before or after it has been produced. A
// producer never blocks when using a Rendezvous. A consumer has the
// choice of making a blocking call or providing a callback: in either
// case, the consumer receives the Tensor as soon as it is available.
As you noted in your question, TensorFlow represents communication in the dataflow graph using Send and Recv ops that are added to the graph automatically when the graph is partitioned across devices. For each edge that has a source and destination on different devices, the graph partitioner inserts a pair of Send and Recv ops that share the same "rendezvous key" (an automatically generated string name that is used as a key in the rendezvous' index of pending tensors to be communicated). The implementation of the Send op is simple: it calls Rendezvous::Send(), passing in its rendezvous key and single input tensor, then returns immediately without blocking. The implementation of the Recv op is slightly more complicated: it registers a callback to be called when the tensor with the given key becomes available. That callback is responsible for "producing" the output of the Recv op, and unblocking subsequent computation.
The Rendezvous implementations perform the actual work of transferring the data:
IntraProcessRendezvous handles the transfer of data between devices in the same process. In the (unlikely) event that the transfer is between two CPU devices in the same process, the transfer can be achieved by a simple Tensor assignment. Otherwise, TensorFlow kicks off a device-specific DMA routine to transfer data between a CPU and GPU device.
The BaseRemoteRendezvous class and its subclasses handle cross-device communication in the case that the send and receiver can be in different processes. The main implementation of this class is RpcRemoteRendezvous, which uses gRPC to handle the remote transfers.

Tensorflow Device vs DeviceContext

I am implementing a new Device for Tensorflow. I would like some clarification between the Device and the DeviceContext. I have read this question but I think I need a bit more info.
Should it be that each device in my system has one Device instance, with the device instance maintaining information about that physical device? Then the DeviceContext should be maintaining runtime information about this device.
In the other question, the answers state that the GPU device maintains several device contexts, one for each stream, with streams given particular jobs (copying vs executing). It sounds like the kernel ops bound to specific device contexts, and if so, when/where does that occur?
Since the GPUDevice has multiple contexts per device, I would argue that you do not need to have one context per device. As such, I would agree that the device class would contain minimal data about the actual hardware which the device context would behave as more of a runtime control of the device (handling memory allocation, data transfer, execution, etc.) judging by the names of the functions in the context
The binding of kernel ops to contexts in GPUs occurs in the FillContextMap where the computation graph nodes are attached to device contexts.

How'd multi-GPU programming work with Vulkan?

Would using multi-GPUs in Vulkan be something like making many command queues then dividing command buffers between them?
There are 2 problems:
In OpenGL, we use GLEW to get functions. With more than 1 GPU, each GPU has its own driver. How'd we use Vulkan?
Would part of the frame be generated with a GPU & the others with other GPUs like use Intel GPU to render UI & AMD or Nvidia GPU to render game screen in labtops for example? Or would a frame be generated in a GPU & the next frame in an another GPU?
Updated with more recent information, now that Vulkan exists.
There are two kinds of multi-GPU setups: where multiple GPUs are part of some SLI-style setup, and the kind where they are not. Vulkan supports both, and supports them both in the same computer. That is, you can have two NVIDIA GPUs that are SLI-ed together, and the Intel embedded GPU, and Vulkan can interact with them all.
Non-SLI setups
In Vulkan, there is something called the Vulkan instance. This represents the base Vulkan system itself; individual devices register themselves to the instance. The Vulkan instance system is, essentially, implemented by the Vulkan SDK.
Physical devices represent a specific piece of hardware that implements the interface to a GPU. Each piece of hardware that exposes a Vulkan implementation does so by registering its physical device with the instance system. You can query which physical devices are available, as well as some basic properties about them (their names, how much memory they offer, etc).
You then create logical devices for the physical devices you use. Logical devices are how you actually do stuff in Vulkan. They have queues, command buffers, etc. And each logical device is separate... mostly.
Now, you can bypass the whole "instance" thing and load devices manually. But you really shouldn't. At least, not unless you're at the end of development. Vulkan layers are far too critical for day-to-day debugging to just opt out of that.
There are mechanisms, core in Vulkan 1.1, that allow individual devices to be able to communicate some information to other devices. In 1.1, only certain kinds of information can be shared across physical devices (namely, fences and semaphores, and even then, only on Linux through sync files). While these APIs could provide a mechanism for sharing data between two physical devices, at present, the restriction on most forms of data sharing is that both physical devices must have matching UUIDs (and therefore are the same physical device).
SLI setups
Dealing with SLI is covered by two Vulkan 1.0 extensions: KHR_device_group and KHR_device_group_creation. The former is for dealing with "device groups" in Vulkan, while the latter is an instance extension for creating device-grouped devices. Both of these are core in Vulkan 1.1.
The idea with this is that the SLI aggregation is exposed as a single VkDevice, which is created from a number of VkPhysicalDevices. Each internal physical device is a "sub-device". You can query sub-devices and some properties about them. Memory allocations are specific to a particular sub-device. Resource objects (buffers and images) are not specific to a sub-device, but they can be associated with different memory allocations on the different sub-devices.
Command buffers and queues are not specific to sub-devices; when you execute a CB on a queue, the driver figures out which sub-device(s) it will run on, and fills in the descriptors that use the images/buffers with the proper GPU pointers for the memory that those images/buffers have been bound to on those particular sub-devices.
Alternate-frame rendering is simply presenting images generated from one sub-device on one frame, then presenting images from a different sub-device on another frame. Split-frame rendering is handled by a more complex mechanism, where you define the memory for the destination image of a rendering command to be split among devices. You can even do this with presentable images.
In vulkan you need to enumerate the devices and select the one you want to work with. There will be nothing stopping you from trying to work with 2 different ones separately. Each vulkan call needs at least 1 parameter as context. The loader layer will then forward the call to the correct driver. Or you can load the functions for each device separately to avoid the loader's trampoline.
A generated frame will need to be forwarded to the card that is connected to the screen for display. So it's more likely that 1 GPU is responsible for graphics and the others are used for physics.
Only a single device can be connected to a specific surface at a time so that device needs to get the rendered frame to copy it into the renderable image that gets pushed to the screen.
Device group is the way to go. Look at the vulkan specification for documentation. Vulkan handle all the dispatch to the others GPUs (when they are connected by sli/crossfire). All you need to do is to tell vulkan how the dispatch is done (for example dispatch one frame on a GPU and the next on another one). If you need to do compute work you will need to address each GPU individually. Please find a link for a reference: https://www.ea.com/seed/news/khronos-munich-2018-halcyon-vulkan