Tensorflow Cross Device Communication - tensorflow

As the tensorflow paper states, Tensorflow' cross-device communication is achieved by adding "receive node" and "send node" into devices.
From my understanding, the device(Please considering only CPU devices are involved) is responsible for performing the computation of an operation. However,the data(ex:Tensor produced from an operation, Variable buffer) resides in memory. I don't know how data transfer from one device to another device is achieved physically. I guess the data transfer is achieved by shared memory. Is that right?
I will appreciate any explanation/corresponding codes regarding how the data transfer is achieved.
PS: TensorFlow paper link, Figure 4 shows the cross-device communication mechanism.

In TensorFlow, cross-device communication is achieved using the Rendezvous interface, which has multiple different implementations, depending on the deployment. The comment on that interface describes the general idea:
// A Rendezvous is an abstraction for passing a Tensor
// from a producer to a consumer, where the consumer may safely
// request the Tensor before or after it has been produced. A
// producer never blocks when using a Rendezvous. A consumer has the
// choice of making a blocking call or providing a callback: in either
// case, the consumer receives the Tensor as soon as it is available.
As you noted in your question, TensorFlow represents communication in the dataflow graph using Send and Recv ops that are added to the graph automatically when the graph is partitioned across devices. For each edge that has a source and destination on different devices, the graph partitioner inserts a pair of Send and Recv ops that share the same "rendezvous key" (an automatically generated string name that is used as a key in the rendezvous' index of pending tensors to be communicated). The implementation of the Send op is simple: it calls Rendezvous::Send(), passing in its rendezvous key and single input tensor, then returns immediately without blocking. The implementation of the Recv op is slightly more complicated: it registers a callback to be called when the tensor with the given key becomes available. That callback is responsible for "producing" the output of the Recv op, and unblocking subsequent computation.
The Rendezvous implementations perform the actual work of transferring the data:
IntraProcessRendezvous handles the transfer of data between devices in the same process. In the (unlikely) event that the transfer is between two CPU devices in the same process, the transfer can be achieved by a simple Tensor assignment. Otherwise, TensorFlow kicks off a device-specific DMA routine to transfer data between a CPU and GPU device.
The BaseRemoteRendezvous class and its subclasses handle cross-device communication in the case that the send and receiver can be in different processes. The main implementation of this class is RpcRemoteRendezvous, which uses gRPC to handle the remote transfers.

Related

Rendering Terrain Dynamically with Argument Buffers : Understanding why the particle buffer is not overwritten by the GPU inflight

I am looking through an Apple demo project that is associated with the 2017 WWDC video entitled "Introducing Metal 2" where the developers demonstrate the use of argument buffers. The project is linked here on the page titled "Rendering Terrain Dynamically with Argument Buffers" on the Apple developer website. Here, they synchronize resource writes by the CPU to prevent race conditions with a dispatch_semaphore_t, signaling it when the command buffer finishes executing on the GPU and waiting on it if the CPU is writing data several frames ahead of the GPU. This is consistent with what was shown in a previous 2014 WWDC "Working With Metal: Fundamentals".
I noticed that it seems the APPLParticleRenderer is sending data to be written by the GPU in a compute pass before it finishes reading from that same buffer from the fragment shader from a previous render pass. The resource storage mode of the buffer is MTLResourceStorageModePrivate. My question: does Metal automatically synchronize access to private id<MTLBuffer>s accessible only by the GPU? Do render, compute, and blit passes called from new id<MTLCommandEncoder> have access to the buffer only after other passes have written and read from it (exclusive access)? I have seen that there are guaranteed barriers within tile shaders, where tile memory is accessed exclusively by the kernel before subsequent fragment shaders access the memory.
Lastly, in the 2016 WWDC "What's New in Metal, Part 2", the first presenter, Charles Brissart, at 16:44 mentions that fragment and vertex functions reading and writing from the same buffer must be placed into two render command encoders, but for compute kernels one compute command encoder suffices. This is consistent with what is seen within the particle renderer.
See my comment on the original question for a brief version of this answer.
It turns out that Metal tracks dependencies between commands scheduled to the GPU by default for MTLResource types. The hazardTrackingMode property of a MTLResource is defaulted to MTLHazardTrackingModeTracked (MTLHazardTrackingMode.tracked in Swift) according to the Metal documentation. This means Metal tracks dependencies across commands that modify the resource, as is the case with the particle kernel, and delays execution until prior commands accessing the resource are complete.
Therefore, since the _particleDataPool buffer has a storage mode of MTLResourceStorageModePrivate (storageModePrivate in Swift), it can only be written to by the GPU; hence, no CPU/GPU synchronization is necessary with a semaphore for this buffer and thus no multi-buffer system is necessary for the resource.
Only when a resource can be written to by the CPU while the GPU is still reading from it do we want multiple buffers so the CPU is not idle.
Note that the default hazard tracking mode for a MTLHeap is MTLHazardTrackingModeUntracked (MTLHazardTrackingMode.untracked in Swift), in which case you are responsible for synchronizing resource writes by the GPU
EDIT
After reading into resource synchronization in Metal, there are some additional points I would like to make that I think further clarify what's going on. Note that the remaining portion is in Swift. To learn more in detail, I recommend reading the "Synchronization" section in the Metal documentation here.
MTLFence
Firstly, a MTLFence is used to synchronize accesses to untracked resources within the execution of a single command buffer. A fence gives you explicit control over when the GPU accesses resources and is necessary when you are working with an untracked resource. Otherwise, Metal will handle this synchronization for you
It is important to note that the automatic management I mention in the answer only occurs within a single command buffer between encoding passes. But this does not mean we need to synchronize across command buffers scheduled in the same command queue since a command buffer is not immediately scheduled for execution. In fact, according to the documentation on the addScheduledHandler(_:) method of the MTLCommandBuffer protocol found here
The device object schedules the command buffer after it identifies any dependencies with work tasks submitted by other command buffers or other APIs in the system.
at which point it would be safe to access these same buffers. Note that within a single render encoding pass, it is important to mention that if a vertex shader writes into a buffer the fragment shader in the same pass reads from, this is undefined. I mentioned this in the original question, the solution being to use two render pass encoders. I have yet to determine why this is not necessary for a compute encoder, but I imagine it has to do with how kernels are executed in comparison to vertex and fragment shaders
MTLEvent
In some cases, however, command buffers in different queues created by the same MTLDevice need access to the same resource or depend on one another in some way. In this case, synchronization is necessary because the separate queues schedule their own command buffers without knowledge of the other, meaning there is potential for the two command buffers to be executing concurrently.
To fix this problem, you use an MTLEvent instance created by the device using makeEvent() and encode event signals at specific points in each buffer.
MTLSharedEvent
In the event (no pun intended) that you have multiple processors (different CPU cores, CPU and GPU, or multi-GPU), resource synchronization is needed. Here, you create a MTLSharedEvent in place of a MTLEvent that can be used to synchronize across devices and processes. It is essentially the same API as that of the MTLEvent, but involves command queues on different devices.

Can I do transfer operation in transfer queue and graphics queue at the same time?

I have made 2 instances of VkQueue: one from graphics family and another one from transfer family. Command pools and command buffers are separated accordingly. Both are doing transfer operations.
Purpose of first one except rendering is to update uniform buffers on
each frame.
Purpose of second one is to update resources: model
vertex/index buffers, texture images etc.
They work in parallel in different threads asynchronously. So it is possible that there will be 2 calls of vkQueueSubmit at the same time.
Is such usage allowed and is it safe?
Note: once I have multithreaded my program sometimes I have VK_DEVICE_LOST on vkQueueSumbit and it is likely that it happens more frequently when resources are loading, that is why I actually came to this question
The Vulkan specification is pretty clear about CPU synchronization of Vulkan functions. vkQueueSubmit says:
Host access to queue must be externally synchronized
Where "queue" is the parameter passed to vkQueueSubmit. It doesn't say every queue; it says "that queue".
And if "external synchronization" is not specifically stated as a requirement of a command, then it isn't a requirement of that command.

Tensorflow Device vs DeviceContext

I am implementing a new Device for Tensorflow. I would like some clarification between the Device and the DeviceContext. I have read this question but I think I need a bit more info.
Should it be that each device in my system has one Device instance, with the device instance maintaining information about that physical device? Then the DeviceContext should be maintaining runtime information about this device.
In the other question, the answers state that the GPU device maintains several device contexts, one for each stream, with streams given particular jobs (copying vs executing). It sounds like the kernel ops bound to specific device contexts, and if so, when/where does that occur?
Since the GPUDevice has multiple contexts per device, I would argue that you do not need to have one context per device. As such, I would agree that the device class would contain minimal data about the actual hardware which the device context would behave as more of a runtime control of the device (handling memory allocation, data transfer, execution, etc.) judging by the names of the functions in the context
The binding of kernel ops to contexts in GPUs occurs in the FillContextMap where the computation graph nodes are attached to device contexts.

TensorFlow Device Contexts, Streams and Context Switching

In the GPUDevice code, I noticed that one GPUDeviceContext is made per stream.
Is the purpose of this so that every context can control one OpKernelContext and then as the various streams need to be executed, then the contexts can just be switched which handles pushing different data/code onto the GPU and then executing.
Do the various streams get registered as different devices (ie. '/gpu:0' and '/gpu:1')?
Per this, ThreadPoolDevice's don't have contexts, but if I were to add contexts into ThreadPoolDevice, would they fit best as a sort of ThreadContext?
For GPU, we maintain a few streams for execution: a compute stream (on which most computational kernels run), and some memcopy streams (for executing memcopies between host and device and vice versa). This is done to overlap communication and computation on GPU devices, but is particular to the way that we use GPUs. One could easily also just create one GPU stream for all computation and communication and it would be correct, although slower.
We want to give the computation stream to kernels that do computation, and the memcopy stream to the kernels that do copying. We create a GPUDeviceContext object for each stream, and then pass the right device context object to the OpKernelContext.
So the particular implementations here reflect the properties of the asynchronous hardware device (GPU), which is why the ThreadPoolDevice doesn't have these sorts of mechanisms. On CPU all computation is synchronous, so there is no need for an abstraction such as streams.
The execution model of the custom hardware will likely determine what kind of state and management a custom device support will require in TensorFlow.

Microcontroller to microcontroller communication library (over UART/RS232)

I want to interface two microcontrollers with a UART interface and I search a protocol to exchange data between them.
In practice, I want to exchange data periodically (ie: sensors reading) and also data on event (GPIO state). I have around 100-200 bytes to exchange every 100 milli second.
Does anybody know a protocol or library to achieve this kind of task ?
For now, I see protobuf and nano protobuff ? Is there something else ?
It would be nice if I could add a software layer over the UART and use "virtual data stream" like if it was a TCP/IP connection to N ports.
Any idea ?
Thanks
I think the most straight forward way is to roll your own.
You'll find RS232 drivers in the manufacturers chip support library.
RS232 is a stream oriented transport, that means you will need to encode your messages into some frameing structure when you send them and detect frame boundaries on the receiver side. A clever and easy to use mechanism to do this is "Consistent Overhead Byte Stuffing".
https://en.wikipedia.org/wiki/Consistent_Overhead_Byte_Stuffing
This simple algorithm turns zeros in your messages into some other value, so the zero-byte can be used to detect start and end of frame. If a byte gets corrupted on the way you can even resynchronize to the stream and keep going.
The code on Wikipedia should be easy enough even for the smallest micro-processors.
Afterwards you can define your message format. You can probably keep it very simple and directly send your data-structures as is.
Suggestion for a simple message format:
Byte-ID Meaning
---------------------------------
0 Destination port number
1 message type (define your own)
2 to n message data
If you want to send variable length messages you can either send out a length byte or derive the length from the output of the Constant Overhead Byte Stuffing framing.
By the way, UART/RS232 is nice and easy to work with, but you may also want to take a look at SPI. The SPI interface is more suitable to exchange data between two micro-controllers. It is usually faster than RS232 and more robust because it has a dedicated clock-line.
How about this: eRPC https://community.nxp.com/docs/DOC-334083
The eRPC (Embedded Remote Procedure Call) is a Remote Procedure Call (RPC) system created by NXP. An RPC is a mechanism used to invoke a software routine on a remote system using a simple local function call. The remote system may be any CPU connected by an arbitrary communications channel: a server across a network, another CPU core in a multicore system, and so on. To the client, it is just like calling a function in a library built into the application. The only difference is any latency or unreliability introduced by the communications channel.
I have use it in a two processor embedded system, a cortext-A9 CPU with a Context-M4 MCU, which communicate each other with SPI/GPIO.
Erpc can run over UART, SPI, rpmsg and network(tcp). even when using serial or SPI as transport tunnel, it can do bidirectional
calls and with very minimal footprint.
Simple serial point-to-point communication protocol
http://www.zipplet.co.uk/index.php/content/openformats_mise
It depends if you need master/slave implementation, noise protection, point-point or multi-point (and in this case collision detection), etc
but, as our colleague said, I would go with the simplest solution that fits the problem, following the KISS principle http://en.wikipedia.org/wiki/KISS_principle
Just add some header information like ID and length, if necessary CRC checking, and be happy :)
Try Microcontroller Interconnect Network (MIN) 1.0:
https://github.com/min-protocol/min
It has framing using byte-stuffing to keep receiver sync, 16-bit Fletcher's algorithm for checksum, an identifier for use by the application and a variable payload of up to 15 bytes.
There's embedded C code there plus also a Python implementation to make it easier to talk to a PC.
As the first answer starts, the simplest result is to roll your own. Define your header (the "format" above) as needed, perhaps including status information so each processor knows that the other is working properly. I have had success with a protocol that includes
2 byte ascii prefix and suffix such as "[" and "]" so that a
protocol analyzer can show you message boundaries.
The number of bytes.
The command ID (parsed to indicate what command handler to use.
Command arguments (I used 3 32 bit words).
A CRC or checksum to verify transfer integrity
The parser then recognizes the [* as the start of the message, and dispatches the body to the command handler for the particular command ID with the associated arguments as long as the checksum matches.