How does one transfer CUDA constant memory in tensorflow's C++ API - tensorflow

Say I have a CUDA GPU kernel for a custom tensorlfow op that uses constant memory:
__constant__ int cdata[100];
__global__ void frobulate(float * data)
{
int i = blockDim.x*blockIdx.x + threadIdx.x;
float value = data[i];
for(int j=0; j < 100; ++j) {
value += cdata[i];
}
}
Then, when implementing the Compute method in my Frobulate custom op
class Frobulate : public tensorflow::OpKernel
{
public:
void Compute(OpKernelContext * context) override
{
...
// Get the current device
const Device & device = context->eigen_device<Eigen::GpuDevice>();
// Local, mutating version of constant data.
// For illustration purposes only
int local_data[100];
// Reason about our local shape
TensorShape local_shape(100);
// Create a pointer to hold allocated output
Tensor * pinned_ary_ptr = nullptr;
// Allocate memory for the complex_phase,
// I don't think allocate_output is correct here...
// but we need pinned host memory for an async transfer
OP_REQUIRES_OK(context, context->allocate_output(
0, local_shape, &pinned_ary_ptr));
for(int i=0; i<100; ++i)
{ pinned_ary_ptr[i] = local_data[i]; }
// Get the symbol address of cdata and enqueue an
// async transfer on the device's stream
int * d_cdata_ptr;
cudaGetSymbolAddress((void **)&d_cdata_ptr, &cdata);
cudaMemcpyAsync(d_cdata_ptr, pinned_ary_ptr, sizeof(int)*100,
cudaMemcpyHostToDevice, device.stream());
// Call the kernel
frobulate<<<grid, blocks, 0, device.stream()>>>(data);
}
};
Is this the right way to go about doing things? i.e. Ideally it would be good to make cdata an Input or Attr in my REGISTER_OP, but I don't think this will link up to the constant data correctly. I think the cudaGetSymbolAddress is necessary...
Is it safe? i.e. Will I interfere with tensorflow's GPU Stream Executor by enqueueing my own cuda commands and memcpys on the supplied stream?
Is context->allocate_output the correct method to call to get some pinned memory? Looking in the tensorflow codebase suggests that there are temp and scratch allocators, but I don't know if they're exposed to the user...
Edit 1: Does this allocate pinned memory? (memory usually allocated with cudaHostAlloc, whose pages are pinned for DMA transfers to the GPU, i.e. they're prevented from being swapped out by the OS).
tensorflow::AllocatorAttributes pinned_allocator;
pinned_allocator.set_on_host(true);
pinned_allocator.set_gpu_compatible(true);
// Allocate memory for the constant data
OP_REQUIRES_OK(context, context->allocate_temp(
DT_UINT8, cdata_shape, &cdata_tensor,
pinned_allocator));

Yes the cudaGetSymbolAddress is necessary. Constant memory is specific to the kernel and should not
It should not. Just make sure that the sequence of operations in your stream execution are in the right order and synced up properly.
Yes output is the memory that the kernel will write as the result of the operation. the scratch memory is mainly used for memory that you need just for a single operation of the kernel. Some cudnn kernels like the convolutions one, use it. See tensorflow/kernels/conv_ops.cc

Related

fatfs f_write returns FR_DISK_ERR when passing a pointer to data in a mail queue

I'm trying to use FreeRTOS to write ADC data to SD card on the STM32F7 and I'm using V1 of the CMSIS-RTOS API. I'm using mail queues and I have a struct that holds an array.
typedef struct
{
uint16_t data[2048];
} ADC_DATA;
on the ADC half/Full complete interrupts, I add the data to the queue and I have a consumer task that writes this data to the sd card. My issue is in my Consumer Task, I have to do a memcpy to another array and then write the contents of that array to the sd card.
void vConsumer(void const * argument)
{
ADC_DATA *rx_data;
for(;;)
{
writeEvent = osMailGet(adcDataMailId, osWaitForever);
if(writeEvent.status == osEventMail)
{
// write Data to SD
rx_data = writeEvent.value.p;
memcpy(sd_buff, rx_data->data, sizeof(sd_buff));
if(wav_write_result == FR_OK)
{
if( f_write(&wavFile, (uint8_t *)sd_buff, SD_WRITE_BUF_SIZE, (void*)&bytes_written) == FR_OK)
{
file_size+=bytes_written;
}
}
osMailFree(adcDataMailId, rx_data);
}
}
This works as intended but if I try to change this line to
f_write(&wavFile, (uint8_t *)rx_data->data, SD_WRITE_BUF_SIZE, (void*)&bytes_written) == FR_OK)
so as to get rid of the memcpy, f_write returns FR_DISK_ERR. Can anyone help shine a light on why this happens, I feel like the extra memcpy is useless and you should just be able to pass the pointer to the queue straight to f_write.
So just a few thoughts here:
memcpy
Usually I copy only the necessary amount of data. If I have the size of the actual data I'll add a boundary check and pass it to memcpy.
Your problem
I am just guessing here, but if you check the struct definition, the data field has the type uint16_t and you cast it to a byte pointer. Also the FatFs documentation expects a void* for the type of buf.
EDIT: Could you post more details of sd_buff

buffers in CCL code samples along with the oneapi toolkit

I Was going through the CCL code samples along with the oneapi toolkit.
In the below DPC++(SYCL) code initially sendbuf a buffer is created in the cpu side and is not initialised and in the part where offloading to target device takes place the dev_acc_sbuf[id] variable, which is a variable in the kernel scope is modified. This variable(dev_acc_sbuf) is not hence used in the program neither is its value copied back to sendbuf.Then in the next line the sendbuf variable is used for allreduce. I am not able to understand how changing the dev_acc_sbuf makes change in the sendbuf.
cl::sycl::queue q;
cl::sycl::buffer<int, 1> sendbuf(COUNT);
/* open sendbuf and modify it on the target device side */
q.submit([&](cl::sycl::handler& cgh) {
auto dev_acc_sbuf = sendbuf.get_access<mode::write>(cgh);
cgh.parallel_for<class allreduce_test_sbuf_modify>(range<1>{COUNT}, [=](item<1> id) {
dev_acc_sbuf[id] += 1;
});
});
/* invoke ccl_allreduce on the CPU side */
ccl_allreduce(&sendbuf,
&recvbuf,
COUNT,
ccl_dtype_int,
ccl_reduction_sum,
NULL,
NULL,
stream,
&request);
In the line "auto dev_acc_sbuf = sendbuf.get_access<mode::write>(cgh);" the dev_acc_sbuf is a handle that accesses sendbuf and not a seperate buffer. The changes made in the dev_acc_sbuf handle gets reflected to the original buffer ie the sendbuffer . This is an advantage in SYCL as the changes made in the kernel scope is automatically copied back to the original variable
On most systems, the host and the device do not share physical memory, the CPU might use RAM and the GPU might use its own global memory. SYCL needs to know which data it will be sharing between the host and the devices.
For this purpose, SYCL uses its buffers, the buffer class is generic over the element type and the number of dimensions. When passed a raw pointer, the buffer(T* ptr, range size) constructor takes ownership of the memory it has been passed. This means that we absolutely cannot use that memory ourselves while the buffer exists, which is why we begin a C++ scope. At the end of their scope, the buffers will be destroyed and the memory returned to the user. A size argument is a range object, which has to have the same number of dimensions as the buffer and is initialized with the number of elements in each dimension. Here, we have one dimension with one element.
Buffers are not associated with a particular queue or context, so they are capable of handling data transparently between multiple devices.
Accessors are used to access request control over the device memory from the buffer objects. Their modes will take care of data movement between host and device. So we need not have to explicitly copy back the result from device to host.
Below is the example for more clarification:
#include <bits/stdc++.h>
#include <CL/sycl.hpp>
using namespace std;
class vector_addition;
int main(int, char**) {
//creating host memory
int *a=(int *)malloc(10*sizeof(int));
int *b=(int *)malloc(10*sizeof(int));
int *c=(int *)malloc(10*sizeof(int));
for(int i=0;i<10;i++){
a[i]=i;
b[i]=10-i;
}
cl::sycl::default_selector device_selector;
cl::sycl::queue queue(device_selector);
std::cout << "Running on "<< queue.get_device().get_info<cl::sycl::info::device::name>()<< "\n";
{
//creating buffer from pointer of host memory
cl::sycl::buffer<int, 1> a_sycl{a, cl::sycl::range<1>{10} };
cl::sycl::buffer<int, 1> b_sycl{b, cl::sycl::range<1>{10} };
cl::sycl::buffer<int, 1> c_sycl{c, cl::sycl::range<1>{10} };
queue.submit([&] (cl::sycl::handler& cgh) {
//creating accessor of buffer with proper mode
auto a_acc = a_sycl.get_access<cl::sycl::access::mode::read>(cgh);
auto b_acc = b_sycl.get_access<cl::sycl::access::mode::read>(cgh);
auto c_acc = c_sycl.get_access<cl::sycl::access::mode::write>(cgh);//responsible for copying back to host memory
//kernel for execution
cgh.parallel_for<class vector_addition>(cl::sycl::range<1>{ 10 }, [=](cl::sycl::id<1> idx) {
c_acc[idx] = a_acc[idx] + b_acc[idx];
});
});
}
for(int i=0;i<10;i++){
cout<<c[i]<<" ";
}
cout<<"\n";
return 0;
}

When can I free resources and structures passed to a vulkan vkCreateXXX function?

I'm starting to learn Vulkan, and want to know if VkCreate[...] functions copy the resources pointed in structs into his own buffers.
To clarify my question, in this code I load a SPIR shader into my own mkShader struct and then I create the shadermodule with vkCreateShaderModule.
static VkShaderModule mkVulkanCreateShaderModule(MkVulkanContext *vc,
const char *filename)
{
VkShaderModule shaderModule;
struct mkShader *shader = mkVulkanLoadShaderBinary(filename);
VkShaderModuleCreateInfo createInfo = {0};
createInfo.sType = VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO;
createInfo.codeSize = shader->size;
createInfo.pCode = (uint32_t *)shader->buffer;
if (vkCreateShaderModule(vc->device, &createInfo, NULL,
&shaderModule) != VK_SUCCESS) {
printf(ANSI_COLOR_RED
"Failed to create shader module\n" ANSI_COLOR_RESET);
assert(0);
exit(EXIT_FAILURE);
}
mkVulkanFreeShaderBinary(shader);
return shaderModule;
}
As you can see I'm freeing the mkShader struct with mkVulkanFreeShaderBinaryafter shader module creation and I'm not receiving any error from my program. So my question is if this is safe to do, or I have to keep the mkShader struct until I destroy the shader module. And also, if this is valid to all VkCreate[...] functions or not, and if this information is anywhere in the Vulkan spec.
See Object Lifetime of the Vulkan specification.
The ownership of application-owned memory is immediately acquired by any Vulkan command it is passed into. Ownership of such memory must be released back to the application at the end of the duration of the command, so that the application can alter or free this memory as soon as all the commands that acquired it have returned.
In other words, anything you allocate you are free to delete as soon as a Vulkan function call returns. Additionally, once you've created your pipeline, you're free to destroy the VkShaderModule too.

Tensorflow new Op CUDA kernel memory management

I am have implemented a rather complex new Op in Tensorflow with a GPU CUDA kernel.
This Op requires a lot of dynamic memory allocation of variables which are not tensors and are deallocated after the op is done, more specifically it involves using a hash table.
Right now I am using cudaMalloc() and cudaFree() but I have noticed Tensorflow has its own type called Eigen::GPUDevice which has the ability to allocate and deallocate memory on the GPU.
My questions:
Is it best practice to use Eigen::GPUDevice to manage GPU memory;
By using Eigen::GPUDevice instead of the CUDA API I am "automatically" enabling multi-GPU support since different GPUDevices can be passed to the Op;
Should I extend this idea to the CPU kernel and see if there is a CPUDevice type which also manages the memory instead of using C++ syntax (i.e. auto var = new int[100]; delete[] var)
The is no direct public guideline for this issue. I usually just let the TensorFlow allocate this information by
template<typename Device, typename Dtype>
class MyOp: public OpKernel {
{
public:
explicit MyOp(OpKernelConstruction *context) :
OpKernel(context)
{
// ...
}
void Compute(OpKernelContext *context) override
{
Tensor* tmp_var = nullptr;
Tensor* output = nullptr;
TensorShape some_shape, some_shape2;
// temparily use this space
OP_REQUIRES_OK(ctx, ctx->allocate_temp(DT_FLOAT, some_shape, &tmp_var));
// allocate memory for output tensor
OP_REQUIRES_OK(ctx, ctx->allocate_output(0, some_shape2, &output));
whatever needs memory, should be allocated by the TensorFlow context and not by custom cudaMalloc or new type[num] calls.
the context should provide the information for the Allocator
see below
Consider, for the sake of simplicity just adding two matrices (full example).
TensorFlow-Operations usually contain the following structure:
Op description having REGISTER_OP, which is responsible for shape-checking, and setting the output shape (example)
OpKernel responsible for allocating memory, getting pointer to the inputs and setup stuff, (see above or this )
Functor for the implementation itself, like
Tensor* output = nullptr;
Tensor* tmp_var = nullptr;
OP_REQUIRES_OK(ctx, ctx->allocate_output(0, output_shape, &output));
OP_REQUIRES_OK(ctx, ctx->allocate_temp(0, some_shape, &tmp_var));
// the function does not need to care about the memory allocation as everything is already setup at this point
::tensorflow::functor::MyFunctor<Device, Dtype>()(ctx, inputA, inputB, tmp_var, output);
You are just left by implementing
// gpu version
template <typename Dtype>
struct MyFunctor<GPUDevice, Dtype> {
void operator ()(::tensorflow::OpKernelContext* ctx,...)
// cpu version
template <typename Dtype>
struct MyFunctor<CPUDevice, Dtype> {
void operator ()(::tensorflow::OpKernelContext* ctx,...)
edit
allocate_persistent: use this if you need your data between Op invocations like one-time index structures.[example]
allocate_temp just tmp memory which will be not retained at the end of the Compute method lifetime. [example]
But I highly recommend reading the comment in the source-code here and then decided depending on your use case.
The best practice is to use the OpKernelContext::allocate_persistent() method to allocate memory, in the form of a tensorflow::Tensor, that outlives a single call to OpKernel::Compute(). It uses the appropriate Allocator* for the device, so if the kernel runs on a GPU device, it will allocate GPU memory for that particular device, and if it runs on a CPU device it will allocate CPU memory.

Write UART on PIC18

I need help with the uart communication I am trying to implement on my Proteus simulation. I use a PIC18f4520 and I want to display on the virtual terminal the values that have been calculated by the microcontroller.
Here a snap of my design on Proteus
Right now, this is how my UART code looks like :
#define _XTAL_FREQ 20000000
#define _BAUDRATE 9600
void Configuration_ISR(void) {
IPR1bits.TMR1IP = 1; // TMR1 Overflow Interrupt Priority - High
PIE1bits.TMR1IE = 1; // TMR1 Overflow Interrupt Enable
PIR1bits.TMR1IF = 0; // TMR1 Overflow Interrupt Flag
// 0 = TMR1 register did not overflow
// 1 = TMR1 register overflowed (must be cleared in software)
RCONbits.IPEN = 1; // Interrupt Priority High level
INTCONbits.PEIE = 1; // Enables all low-priority peripheral interrupts
//INTCONbits.GIE = 1; // Enables all high-priority interrupts
}
void Configuration_UART(void) {
TRISCbits.TRISC6 = 0;
TRISCbits.TRISC7 = 1;
SPBRG = ((_XTAL_FREQ/16)/_BAUDRATE)-1;
//RCSTA REG
RCSTAbits.SPEN = 1; // enable serial port pins
RCSTAbits.RX9 = 0;
//TXSTA REG
TXSTAbits.BRGH = 1; // fast baudrate
TXSTAbits.SYNC = 0; // asynchronous
TXSTAbits.TX9 = 0; // 8-bit transmission
TXSTAbits.TXEN = 1; // enble transmitter
}
void WriteByte_UART(unsigned char ch) {
while(!PIR1bits.TXIF); // Wait for TXIF flag Set which indicates
// TXREG register is empty
TXREG = ch; // Transmitt data to UART
}
void WriteString_UART(char *data) {
while(*data){
WriteByte_UART(*data++);
}
}
unsigned char ReceiveByte_UART(void) {
if(RCSTAbits.OERR) {
RCSTAbits.CREN = 0;
RCSTAbits.CREN = 1;
}
while(!PIR1bits.RCIF); //Wait for a byte
return RCREG;
}
And in the main loop :
while(1) {
WriteByte_UART('a'); // This works. I can see the As in the terminal
WriteString_UART("Hello World !"); //Nothing displayed :(
}//end while(1)
I have tried different solution for WriteString_UART but none has worked so far.
I don't want to use printf cause it impacts other operations I'm doing with the PIC by adding delay.
So I really want to make it work with WriteString_UART.
In the end I would like to have someting like "Error rate is : [a value]%" on the terminal.
Thanks for your help, and please tell me if something isn't clear.
In your WriteByte_UART() function, try polling the TRMT bit. In particular, change:
while(!PIR1bits.TXIF);
to
while(!TXSTA1bits.TRMT);
I don't know if this is your particular issue, but there exists a race-condition due to the fact that TXIF is not immediately cleared upon loading TXREG. Another option would be to try:
...
Nop();
while(!PIR1bits.TXIF);
...
EDIT BASED ON COMMENTS
The issue is due to the fact that the PIC18 utilizes two different pointer types based on data memory and program memory. Try changing your declaration to void WriteString_UART(const rom char * data) and see what happens. You will need to change your WriteByte_UART() declaration as well, to void WriteByte_UART(const unsigned char ch).
Add delay of few miliseconds after line
TXREG = ch;
verify that pointer *data of WriteString_UART(char *data) actually point to
string "Hello World !".
It seems you found a solution, but the reason why it wasn't working in the first place is still not clear. What compiler are you using?
I learned the hard way that C18 and XC8 are used differently regarding memory spaces. With both compilers, a string declared literally like char string[]="Hello!", will be stored in ROM (program memory). They differ in the way functions use strings.
C18 string functions will have variants to access strings either in RAM or ROM (for example strcpypgm2ram, strcpyram2pgm, etc.). XC8 on the other hand, does the job for you and you will not need to use specific functions to choose which memory you want to access.
If you are using C18, I would highly recommend you switch to XC8, which is more recent and easier to work with. If you still want to use C18 or another compiler which requires you to deal with program/data memory spaces, then here below are two solutions you may want to try. The C18 datasheet says that putsUSART prints a string from data memory to USART. The function putrsUSART will print a string from program memory. So you can simply use putrsUSART to print your string.
You may also want to try the following, which consists in copying your string from program memory to data memory (it may be a waste of memory if your application is tight on memory though) :
char pgmstring[] = "Hello";
char datstring[16];
strcpypgm2ram(datstring, pgmstring);
putsUSART(datstring);
In this example, the pointers pgmstring and datstring will be stored in data memory. The string "Hello" will be stored in program memory. So even if the pointer pgmstring itself is in data memory, it initially points to a memory address (the address of "Hello"). The only way to point to this same string in data memory is to create a copy of it in data memory. This is because a function accepting a string stored in data memory (such as putsUSART) can NOT be used directly with a string stored in program memory.
I hope this could help you understand a bit better how to work with Harvard microprocessors, where program and data memories are separated.