Some of tensorflow GPU OpKernel compute by eigen device without stream sync, is that buggy? - tensorflow

from gpu_device.cc
// NOTE(tucker): We need to discriminate between Eigen GPU
// operations and all others. If an operation is Eigen
// implemented (or otherwise tries to launch a cuda kernel
// directly), we need to establish a stacked-scoped environment
// that directs it to execute on the proper device. Otherwise we
// expect the Op to use StreamExecutor directly and correctly. The
// way we make this discrimination is quite hacky: At the moment
// the only non-Eigen GPU Op is the recv-op, which is known to be
// asynchronous.
and gpu_device only waits when different context. (sync_every_op is false)
But in argmax_op.h, for example,
template <typename Device, typename T>
struct ArgMin {
#define DECLARE_COMPUTE_SPEC(Dims) \
EIGEN_ALWAYS_INLINE static void Reduce##Dims( \
const Device& d, typename TTypes<T, Dims>::ConstTensor input, \
const int32 dimension, \
typename TTypes<int64, Dims - 1>::Tensor output) { \
output.device(d) = input.argmin(dimension).template cast<int64>(); \
}
use device compute directly. Is that correct?

I missed something. cuda stream is passed to eigen device. so there's no problem

Related

buffers in CCL code samples along with the oneapi toolkit

I Was going through the CCL code samples along with the oneapi toolkit.
In the below DPC++(SYCL) code initially sendbuf a buffer is created in the cpu side and is not initialised and in the part where offloading to target device takes place the dev_acc_sbuf[id] variable, which is a variable in the kernel scope is modified. This variable(dev_acc_sbuf) is not hence used in the program neither is its value copied back to sendbuf.Then in the next line the sendbuf variable is used for allreduce. I am not able to understand how changing the dev_acc_sbuf makes change in the sendbuf.
cl::sycl::queue q;
cl::sycl::buffer<int, 1> sendbuf(COUNT);
/* open sendbuf and modify it on the target device side */
q.submit([&](cl::sycl::handler& cgh) {
auto dev_acc_sbuf = sendbuf.get_access<mode::write>(cgh);
cgh.parallel_for<class allreduce_test_sbuf_modify>(range<1>{COUNT}, [=](item<1> id) {
dev_acc_sbuf[id] += 1;
});
});
/* invoke ccl_allreduce on the CPU side */
ccl_allreduce(&sendbuf,
&recvbuf,
COUNT,
ccl_dtype_int,
ccl_reduction_sum,
NULL,
NULL,
stream,
&request);
In the line "auto dev_acc_sbuf = sendbuf.get_access<mode::write>(cgh);" the dev_acc_sbuf is a handle that accesses sendbuf and not a seperate buffer. The changes made in the dev_acc_sbuf handle gets reflected to the original buffer ie the sendbuffer . This is an advantage in SYCL as the changes made in the kernel scope is automatically copied back to the original variable
On most systems, the host and the device do not share physical memory, the CPU might use RAM and the GPU might use its own global memory. SYCL needs to know which data it will be sharing between the host and the devices.
For this purpose, SYCL uses its buffers, the buffer class is generic over the element type and the number of dimensions. When passed a raw pointer, the buffer(T* ptr, range size) constructor takes ownership of the memory it has been passed. This means that we absolutely cannot use that memory ourselves while the buffer exists, which is why we begin a C++ scope. At the end of their scope, the buffers will be destroyed and the memory returned to the user. A size argument is a range object, which has to have the same number of dimensions as the buffer and is initialized with the number of elements in each dimension. Here, we have one dimension with one element.
Buffers are not associated with a particular queue or context, so they are capable of handling data transparently between multiple devices.
Accessors are used to access request control over the device memory from the buffer objects. Their modes will take care of data movement between host and device. So we need not have to explicitly copy back the result from device to host.
Below is the example for more clarification:
#include <bits/stdc++.h>
#include <CL/sycl.hpp>
using namespace std;
class vector_addition;
int main(int, char**) {
//creating host memory
int *a=(int *)malloc(10*sizeof(int));
int *b=(int *)malloc(10*sizeof(int));
int *c=(int *)malloc(10*sizeof(int));
for(int i=0;i<10;i++){
a[i]=i;
b[i]=10-i;
}
cl::sycl::default_selector device_selector;
cl::sycl::queue queue(device_selector);
std::cout << "Running on "<< queue.get_device().get_info<cl::sycl::info::device::name>()<< "\n";
{
//creating buffer from pointer of host memory
cl::sycl::buffer<int, 1> a_sycl{a, cl::sycl::range<1>{10} };
cl::sycl::buffer<int, 1> b_sycl{b, cl::sycl::range<1>{10} };
cl::sycl::buffer<int, 1> c_sycl{c, cl::sycl::range<1>{10} };
queue.submit([&] (cl::sycl::handler& cgh) {
//creating accessor of buffer with proper mode
auto a_acc = a_sycl.get_access<cl::sycl::access::mode::read>(cgh);
auto b_acc = b_sycl.get_access<cl::sycl::access::mode::read>(cgh);
auto c_acc = c_sycl.get_access<cl::sycl::access::mode::write>(cgh);//responsible for copying back to host memory
//kernel for execution
cgh.parallel_for<class vector_addition>(cl::sycl::range<1>{ 10 }, [=](cl::sycl::id<1> idx) {
c_acc[idx] = a_acc[idx] + b_acc[idx];
});
});
}
for(int i=0;i<10;i++){
cout<<c[i]<<" ";
}
cout<<"\n";
return 0;
}

How to get Packet Processing(packet_in, flow_match, output) time in OVS switch?

I'm trying to evaluate a routing technique implemented by me with Mininet, Open vSwitch and Ryu controller. But currently I'm unable to figure out the measurement techniques of packet processing time within switch. I can measure probe message processing time as packet_in occurs for those and reports back to controller program. But how to measure processing time for packets whose presence will not be reported back to the controller by switch(packet_in will not occur)? Probably ovs-ofctl command has some options that can report me the time. But still not sure how to do that. Please help me in this circumstance. I have not got enough resources over the internet. Thanks in advance for your help.
As long as you're using the kernel datapath of Open vSwitch, you should be able to retrieve the processing delay for each packet using the usual Linux tracing toolkits.
Below is an example using the BPF infrastructure (requires Linux v4.4+) and the bcc toolkit (I have version 0.5.0-1). Note, however, that for high packet rates, the overhead from running this tool may be significant. Another way to measure the overhead your modifications add is to measure the maximum throughput the switch can achieve with and without your modifications.
#!/usr/bin/env python
from bcc import BPF
import sys
import ctypes as ct
prog = """
#include <uapi/linux/ptrace.h>
#include <linux/openvswitch.h>
struct vport;
enum action_t {
DROP = 0,
OUTPUT,
};
struct proc_record_t {
u64 delay;
enum action_t action;
};
BPF_HASH(pkts, struct sk_buff *, u64, 1024);
BPF_PERF_OUTPUT(events);
// Take a timestamp at packet reception by Open vSwitch.
int
kprobe__ovs_vport_receive(struct pt_regs *ctx, struct vport *port, struct sk_buff *skb) {
u64 ts = bpf_ktime_get_ns();
pkts.update(&skb, &ts);
return 0;
}
// Once the packet has been processed by the switch, measure the processing delay and send to userspace using perf_submit.
static inline void
end_processing(struct pt_regs *ctx, struct sk_buff *skb, enum action_t action) {
u64 *tsp = pkts.lookup(&skb);
if (tsp) {
u64 ts = bpf_ktime_get_ns();
struct proc_record_t record = {};
record.delay = ts - *tsp;
record.action = action;
events.perf_submit(ctx, &record, sizeof(record));
pkts.delete(&skb);
}
}
// Called when packets are dropped by Open vSwitch.
int
kprobe__consume_skb(struct pt_regs *ctx, struct sk_buff *skb) {
end_processing(ctx, skb, DROP);
return 0;
}
// Called when packets are outputted by Open vSwitch.
int
kprobe__ovs_vport_send(struct pt_regs *ctx, struct vport *vport, struct sk_buff *skb) {
end_processing(ctx, skb, OUTPUT);
return 0;
}
"""
b = BPF(text=prog)
class Data(ct.Structure):
_fields_ = [("delay", ct.c_ulonglong),
("action", ct.c_int)]
actions = ["drop", "output"]
print("%-18s %s" % ("DELAY(ns)", "ACTION"))
# Callback function to display information from kernel
def print_event(cpu, data, size):
event = ct.cast(data, ct.POINTER(Data)).contents
print("%-18d %s" % (event.delay, actions[event.action]))
b["events"].open_perf_buffer(print_event)
while True:
b.kprobe_poll()
You'll need to install bcc to execute this script. Then, it's as simple as:
$ sudo python trace_processing_time.py
DELAY(ns) ACTION
97385 drop
55630 drop
38768 drop
61113 drop
10382 output
14795 output
See the bcc documentation for details on how this script works. You will need to change it if you want to support more OpenFlow actions (only drop and output currently).

Tensorflow new Op CUDA kernel memory management

I am have implemented a rather complex new Op in Tensorflow with a GPU CUDA kernel.
This Op requires a lot of dynamic memory allocation of variables which are not tensors and are deallocated after the op is done, more specifically it involves using a hash table.
Right now I am using cudaMalloc() and cudaFree() but I have noticed Tensorflow has its own type called Eigen::GPUDevice which has the ability to allocate and deallocate memory on the GPU.
My questions:
Is it best practice to use Eigen::GPUDevice to manage GPU memory;
By using Eigen::GPUDevice instead of the CUDA API I am "automatically" enabling multi-GPU support since different GPUDevices can be passed to the Op;
Should I extend this idea to the CPU kernel and see if there is a CPUDevice type which also manages the memory instead of using C++ syntax (i.e. auto var = new int[100]; delete[] var)
The is no direct public guideline for this issue. I usually just let the TensorFlow allocate this information by
template<typename Device, typename Dtype>
class MyOp: public OpKernel {
{
public:
explicit MyOp(OpKernelConstruction *context) :
OpKernel(context)
{
// ...
}
void Compute(OpKernelContext *context) override
{
Tensor* tmp_var = nullptr;
Tensor* output = nullptr;
TensorShape some_shape, some_shape2;
// temparily use this space
OP_REQUIRES_OK(ctx, ctx->allocate_temp(DT_FLOAT, some_shape, &tmp_var));
// allocate memory for output tensor
OP_REQUIRES_OK(ctx, ctx->allocate_output(0, some_shape2, &output));
whatever needs memory, should be allocated by the TensorFlow context and not by custom cudaMalloc or new type[num] calls.
the context should provide the information for the Allocator
see below
Consider, for the sake of simplicity just adding two matrices (full example).
TensorFlow-Operations usually contain the following structure:
Op description having REGISTER_OP, which is responsible for shape-checking, and setting the output shape (example)
OpKernel responsible for allocating memory, getting pointer to the inputs and setup stuff, (see above or this )
Functor for the implementation itself, like
Tensor* output = nullptr;
Tensor* tmp_var = nullptr;
OP_REQUIRES_OK(ctx, ctx->allocate_output(0, output_shape, &output));
OP_REQUIRES_OK(ctx, ctx->allocate_temp(0, some_shape, &tmp_var));
// the function does not need to care about the memory allocation as everything is already setup at this point
::tensorflow::functor::MyFunctor<Device, Dtype>()(ctx, inputA, inputB, tmp_var, output);
You are just left by implementing
// gpu version
template <typename Dtype>
struct MyFunctor<GPUDevice, Dtype> {
void operator ()(::tensorflow::OpKernelContext* ctx,...)
// cpu version
template <typename Dtype>
struct MyFunctor<CPUDevice, Dtype> {
void operator ()(::tensorflow::OpKernelContext* ctx,...)
edit
allocate_persistent: use this if you need your data between Op invocations like one-time index structures.[example]
allocate_temp just tmp memory which will be not retained at the end of the Compute method lifetime. [example]
But I highly recommend reading the comment in the source-code here and then decided depending on your use case.
The best practice is to use the OpKernelContext::allocate_persistent() method to allocate memory, in the form of a tensorflow::Tensor, that outlives a single call to OpKernel::Compute(). It uses the appropriate Allocator* for the device, so if the kernel runs on a GPU device, it will allocate GPU memory for that particular device, and if it runs on a CPU device it will allocate CPU memory.

Designing an accumulating Tensorflow GPU operator

I'm designing a GPU op kernel that iteratively accumulates data in a buffer of GPU memory.
It's important that the data remains in GPU memory. So something along the lines of:
with tf.device('/gpu:0'):
buffer = tf.zeros(...)
buffer = accumulate(param11, param12, buffer)
buffer = accumulate(param21, param22, buffer)
buffer = accumulate(param31, param32, buffer)
with tf.device('/cpu:0'):
A = do_some_more_stuff(buffer)
I'd like some input on three approaches that I think can be used to accomplish this:
Allocate output tensor on each call and use that as an input tensor
on the next call. This is simple to implement but I'm concerned that
continual allocation of GPU memory will be an issue.
Will tensorflow release now unused allocations into the GPU memory pool?
REGISTER_OP("Accumulate")
.Input("param1: T")
.Input("param2: T")
.Input("buffer_in: T")
.Output("buffer_out: T")
void Compute(tensorflow::OpKernelContext * ctx) override
{
TensorShape output_shape{...};
Tensor * output_ptr = nullptr;
OP_REQUIRES_OK(ctx, ctx->allocate_output(
0, output_shape, &output_ptr))
kernel<<<grid, blocks, 0, stream>>>(
ctx->input(0), ctx->input(1),
output);
}
Reference input and output tensors and ensure they're referring
to the same data. As I understand the standard ops and OpKernelContext
documentation, this needs to be protected with a mutex as other ops
may also be accessing the underlying referenced tensor...
REGISTER_OP("Accumulate")
.Input("param1: T")
.Input("param2: T")
.Input("buffer_in: Ref(T)")
.Output("buffer_out: Ref(T)")
void Compute(tensorflow::OpKernelContext * ctx) override
{
mutex_lock(mu_);
ctx->forward_ref_input_to_ref_output(2, 0);
kernel<<<grid, blocks, 0, stream>>>(
ctx->input(0), ctx->input(1),
ctx->mutable_input(2, true));
}
Use allocate_persistent() in conjunction with an OpKernelConstruction context
to provide a persistent buffer for accumulation. I'd prefer not to do this because
I'm dealing with variable buffer sizes and they'll probably be fairly large.
I'm not really sure what you're trying to do with your C++ code, but from looking at the python snippet I think tf.assign might help. It allows you to do things like this:
buffer = tf.Variable(...)
param = tf.Variable(...)
accumulate_op = buffer.assign(expr<param, buffer>)
...
sess.run(accumulate_op)
Running accumulate_op should update your buffer on the gpu (you may have to wrap it in a tf.group to avoid fetching the updated value).

How does one transfer CUDA constant memory in tensorflow's C++ API

Say I have a CUDA GPU kernel for a custom tensorlfow op that uses constant memory:
__constant__ int cdata[100];
__global__ void frobulate(float * data)
{
int i = blockDim.x*blockIdx.x + threadIdx.x;
float value = data[i];
for(int j=0; j < 100; ++j) {
value += cdata[i];
}
}
Then, when implementing the Compute method in my Frobulate custom op
class Frobulate : public tensorflow::OpKernel
{
public:
void Compute(OpKernelContext * context) override
{
...
// Get the current device
const Device & device = context->eigen_device<Eigen::GpuDevice>();
// Local, mutating version of constant data.
// For illustration purposes only
int local_data[100];
// Reason about our local shape
TensorShape local_shape(100);
// Create a pointer to hold allocated output
Tensor * pinned_ary_ptr = nullptr;
// Allocate memory for the complex_phase,
// I don't think allocate_output is correct here...
// but we need pinned host memory for an async transfer
OP_REQUIRES_OK(context, context->allocate_output(
0, local_shape, &pinned_ary_ptr));
for(int i=0; i<100; ++i)
{ pinned_ary_ptr[i] = local_data[i]; }
// Get the symbol address of cdata and enqueue an
// async transfer on the device's stream
int * d_cdata_ptr;
cudaGetSymbolAddress((void **)&d_cdata_ptr, &cdata);
cudaMemcpyAsync(d_cdata_ptr, pinned_ary_ptr, sizeof(int)*100,
cudaMemcpyHostToDevice, device.stream());
// Call the kernel
frobulate<<<grid, blocks, 0, device.stream()>>>(data);
}
};
Is this the right way to go about doing things? i.e. Ideally it would be good to make cdata an Input or Attr in my REGISTER_OP, but I don't think this will link up to the constant data correctly. I think the cudaGetSymbolAddress is necessary...
Is it safe? i.e. Will I interfere with tensorflow's GPU Stream Executor by enqueueing my own cuda commands and memcpys on the supplied stream?
Is context->allocate_output the correct method to call to get some pinned memory? Looking in the tensorflow codebase suggests that there are temp and scratch allocators, but I don't know if they're exposed to the user...
Edit 1: Does this allocate pinned memory? (memory usually allocated with cudaHostAlloc, whose pages are pinned for DMA transfers to the GPU, i.e. they're prevented from being swapped out by the OS).
tensorflow::AllocatorAttributes pinned_allocator;
pinned_allocator.set_on_host(true);
pinned_allocator.set_gpu_compatible(true);
// Allocate memory for the constant data
OP_REQUIRES_OK(context, context->allocate_temp(
DT_UINT8, cdata_shape, &cdata_tensor,
pinned_allocator));
Yes the cudaGetSymbolAddress is necessary. Constant memory is specific to the kernel and should not
It should not. Just make sure that the sequence of operations in your stream execution are in the right order and synced up properly.
Yes output is the memory that the kernel will write as the result of the operation. the scratch memory is mainly used for memory that you need just for a single operation of the kernel. Some cudnn kernels like the convolutions one, use it. See tensorflow/kernels/conv_ops.cc