When can I free resources and structures passed to a vulkan vkCreateXXX function? - vulkan

I'm starting to learn Vulkan, and want to know if VkCreate[...] functions copy the resources pointed in structs into his own buffers.
To clarify my question, in this code I load a SPIR shader into my own mkShader struct and then I create the shadermodule with vkCreateShaderModule.
static VkShaderModule mkVulkanCreateShaderModule(MkVulkanContext *vc,
const char *filename)
{
VkShaderModule shaderModule;
struct mkShader *shader = mkVulkanLoadShaderBinary(filename);
VkShaderModuleCreateInfo createInfo = {0};
createInfo.sType = VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO;
createInfo.codeSize = shader->size;
createInfo.pCode = (uint32_t *)shader->buffer;
if (vkCreateShaderModule(vc->device, &createInfo, NULL,
&shaderModule) != VK_SUCCESS) {
printf(ANSI_COLOR_RED
"Failed to create shader module\n" ANSI_COLOR_RESET);
assert(0);
exit(EXIT_FAILURE);
}
mkVulkanFreeShaderBinary(shader);
return shaderModule;
}
As you can see I'm freeing the mkShader struct with mkVulkanFreeShaderBinaryafter shader module creation and I'm not receiving any error from my program. So my question is if this is safe to do, or I have to keep the mkShader struct until I destroy the shader module. And also, if this is valid to all VkCreate[...] functions or not, and if this information is anywhere in the Vulkan spec.

See Object Lifetime of the Vulkan specification.
The ownership of application-owned memory is immediately acquired by any Vulkan command it is passed into. Ownership of such memory must be released back to the application at the end of the duration of the command, so that the application can alter or free this memory as soon as all the commands that acquired it have returned.
In other words, anything you allocate you are free to delete as soon as a Vulkan function call returns. Additionally, once you've created your pipeline, you're free to destroy the VkShaderModule too.

Related

fatfs f_write returns FR_DISK_ERR when passing a pointer to data in a mail queue

I'm trying to use FreeRTOS to write ADC data to SD card on the STM32F7 and I'm using V1 of the CMSIS-RTOS API. I'm using mail queues and I have a struct that holds an array.
typedef struct
{
uint16_t data[2048];
} ADC_DATA;
on the ADC half/Full complete interrupts, I add the data to the queue and I have a consumer task that writes this data to the sd card. My issue is in my Consumer Task, I have to do a memcpy to another array and then write the contents of that array to the sd card.
void vConsumer(void const * argument)
{
ADC_DATA *rx_data;
for(;;)
{
writeEvent = osMailGet(adcDataMailId, osWaitForever);
if(writeEvent.status == osEventMail)
{
// write Data to SD
rx_data = writeEvent.value.p;
memcpy(sd_buff, rx_data->data, sizeof(sd_buff));
if(wav_write_result == FR_OK)
{
if( f_write(&wavFile, (uint8_t *)sd_buff, SD_WRITE_BUF_SIZE, (void*)&bytes_written) == FR_OK)
{
file_size+=bytes_written;
}
}
osMailFree(adcDataMailId, rx_data);
}
}
This works as intended but if I try to change this line to
f_write(&wavFile, (uint8_t *)rx_data->data, SD_WRITE_BUF_SIZE, (void*)&bytes_written) == FR_OK)
so as to get rid of the memcpy, f_write returns FR_DISK_ERR. Can anyone help shine a light on why this happens, I feel like the extra memcpy is useless and you should just be able to pass the pointer to the queue straight to f_write.
So just a few thoughts here:
memcpy
Usually I copy only the necessary amount of data. If I have the size of the actual data I'll add a boundary check and pass it to memcpy.
Your problem
I am just guessing here, but if you check the struct definition, the data field has the type uint16_t and you cast it to a byte pointer. Also the FatFs documentation expects a void* for the type of buf.
EDIT: Could you post more details of sd_buff

buffers in CCL code samples along with the oneapi toolkit

I Was going through the CCL code samples along with the oneapi toolkit.
In the below DPC++(SYCL) code initially sendbuf a buffer is created in the cpu side and is not initialised and in the part where offloading to target device takes place the dev_acc_sbuf[id] variable, which is a variable in the kernel scope is modified. This variable(dev_acc_sbuf) is not hence used in the program neither is its value copied back to sendbuf.Then in the next line the sendbuf variable is used for allreduce. I am not able to understand how changing the dev_acc_sbuf makes change in the sendbuf.
cl::sycl::queue q;
cl::sycl::buffer<int, 1> sendbuf(COUNT);
/* open sendbuf and modify it on the target device side */
q.submit([&](cl::sycl::handler& cgh) {
auto dev_acc_sbuf = sendbuf.get_access<mode::write>(cgh);
cgh.parallel_for<class allreduce_test_sbuf_modify>(range<1>{COUNT}, [=](item<1> id) {
dev_acc_sbuf[id] += 1;
});
});
/* invoke ccl_allreduce on the CPU side */
ccl_allreduce(&sendbuf,
&recvbuf,
COUNT,
ccl_dtype_int,
ccl_reduction_sum,
NULL,
NULL,
stream,
&request);
In the line "auto dev_acc_sbuf = sendbuf.get_access<mode::write>(cgh);" the dev_acc_sbuf is a handle that accesses sendbuf and not a seperate buffer. The changes made in the dev_acc_sbuf handle gets reflected to the original buffer ie the sendbuffer . This is an advantage in SYCL as the changes made in the kernel scope is automatically copied back to the original variable
On most systems, the host and the device do not share physical memory, the CPU might use RAM and the GPU might use its own global memory. SYCL needs to know which data it will be sharing between the host and the devices.
For this purpose, SYCL uses its buffers, the buffer class is generic over the element type and the number of dimensions. When passed a raw pointer, the buffer(T* ptr, range size) constructor takes ownership of the memory it has been passed. This means that we absolutely cannot use that memory ourselves while the buffer exists, which is why we begin a C++ scope. At the end of their scope, the buffers will be destroyed and the memory returned to the user. A size argument is a range object, which has to have the same number of dimensions as the buffer and is initialized with the number of elements in each dimension. Here, we have one dimension with one element.
Buffers are not associated with a particular queue or context, so they are capable of handling data transparently between multiple devices.
Accessors are used to access request control over the device memory from the buffer objects. Their modes will take care of data movement between host and device. So we need not have to explicitly copy back the result from device to host.
Below is the example for more clarification:
#include <bits/stdc++.h>
#include <CL/sycl.hpp>
using namespace std;
class vector_addition;
int main(int, char**) {
//creating host memory
int *a=(int *)malloc(10*sizeof(int));
int *b=(int *)malloc(10*sizeof(int));
int *c=(int *)malloc(10*sizeof(int));
for(int i=0;i<10;i++){
a[i]=i;
b[i]=10-i;
}
cl::sycl::default_selector device_selector;
cl::sycl::queue queue(device_selector);
std::cout << "Running on "<< queue.get_device().get_info<cl::sycl::info::device::name>()<< "\n";
{
//creating buffer from pointer of host memory
cl::sycl::buffer<int, 1> a_sycl{a, cl::sycl::range<1>{10} };
cl::sycl::buffer<int, 1> b_sycl{b, cl::sycl::range<1>{10} };
cl::sycl::buffer<int, 1> c_sycl{c, cl::sycl::range<1>{10} };
queue.submit([&] (cl::sycl::handler& cgh) {
//creating accessor of buffer with proper mode
auto a_acc = a_sycl.get_access<cl::sycl::access::mode::read>(cgh);
auto b_acc = b_sycl.get_access<cl::sycl::access::mode::read>(cgh);
auto c_acc = c_sycl.get_access<cl::sycl::access::mode::write>(cgh);//responsible for copying back to host memory
//kernel for execution
cgh.parallel_for<class vector_addition>(cl::sycl::range<1>{ 10 }, [=](cl::sycl::id<1> idx) {
c_acc[idx] = a_acc[idx] + b_acc[idx];
});
});
}
for(int i=0;i<10;i++){
cout<<c[i]<<" ";
}
cout<<"\n";
return 0;
}

How does one transfer CUDA constant memory in tensorflow's C++ API

Say I have a CUDA GPU kernel for a custom tensorlfow op that uses constant memory:
__constant__ int cdata[100];
__global__ void frobulate(float * data)
{
int i = blockDim.x*blockIdx.x + threadIdx.x;
float value = data[i];
for(int j=0; j < 100; ++j) {
value += cdata[i];
}
}
Then, when implementing the Compute method in my Frobulate custom op
class Frobulate : public tensorflow::OpKernel
{
public:
void Compute(OpKernelContext * context) override
{
...
// Get the current device
const Device & device = context->eigen_device<Eigen::GpuDevice>();
// Local, mutating version of constant data.
// For illustration purposes only
int local_data[100];
// Reason about our local shape
TensorShape local_shape(100);
// Create a pointer to hold allocated output
Tensor * pinned_ary_ptr = nullptr;
// Allocate memory for the complex_phase,
// I don't think allocate_output is correct here...
// but we need pinned host memory for an async transfer
OP_REQUIRES_OK(context, context->allocate_output(
0, local_shape, &pinned_ary_ptr));
for(int i=0; i<100; ++i)
{ pinned_ary_ptr[i] = local_data[i]; }
// Get the symbol address of cdata and enqueue an
// async transfer on the device's stream
int * d_cdata_ptr;
cudaGetSymbolAddress((void **)&d_cdata_ptr, &cdata);
cudaMemcpyAsync(d_cdata_ptr, pinned_ary_ptr, sizeof(int)*100,
cudaMemcpyHostToDevice, device.stream());
// Call the kernel
frobulate<<<grid, blocks, 0, device.stream()>>>(data);
}
};
Is this the right way to go about doing things? i.e. Ideally it would be good to make cdata an Input or Attr in my REGISTER_OP, but I don't think this will link up to the constant data correctly. I think the cudaGetSymbolAddress is necessary...
Is it safe? i.e. Will I interfere with tensorflow's GPU Stream Executor by enqueueing my own cuda commands and memcpys on the supplied stream?
Is context->allocate_output the correct method to call to get some pinned memory? Looking in the tensorflow codebase suggests that there are temp and scratch allocators, but I don't know if they're exposed to the user...
Edit 1: Does this allocate pinned memory? (memory usually allocated with cudaHostAlloc, whose pages are pinned for DMA transfers to the GPU, i.e. they're prevented from being swapped out by the OS).
tensorflow::AllocatorAttributes pinned_allocator;
pinned_allocator.set_on_host(true);
pinned_allocator.set_gpu_compatible(true);
// Allocate memory for the constant data
OP_REQUIRES_OK(context, context->allocate_temp(
DT_UINT8, cdata_shape, &cdata_tensor,
pinned_allocator));
Yes the cudaGetSymbolAddress is necessary. Constant memory is specific to the kernel and should not
It should not. Just make sure that the sequence of operations in your stream execution are in the right order and synced up properly.
Yes output is the memory that the kernel will write as the result of the operation. the scratch memory is mainly used for memory that you need just for a single operation of the kernel. Some cudnn kernels like the convolutions one, use it. See tensorflow/kernels/conv_ops.cc

Is it possible to cast a managed bytes-array to native struct without pin_ptr, so not to bug the GC too much?

It is possible to cast a managed array<Byte>^ to some non-managed struct only using pin_ptr, AFAIK, like:
void Example(array<Byte>^ bfr) {
pin_ptr<Byte> ptr = &bfr[0];
auto data = reinterpret_cast<NonManagedStruct*>(ptr);
data->Header = 7;
data->Length = sizeof(data);
data->CRC = CalculateCRC(data);
}
However, is with interior_ptr in any way?
I'd rather work on managed data the low-level-way (using unions, struct-bit-fields, and so on), without pinning data - I could be holding this data for quite a long time and don't want to harass the GC.
Clarification:
I do not want to copy managed-data to native and back (so the Marshaling way is not an option here...)
You likely won't harass the GC with pin_ptr - it's pretty lightweight unlike GCHandle.
GCHandle::Alloc(someObject, GCHandleType::Pinned) will actually register the object as being pinned in the GC. This lets you pin an object for extended periods of time and across function calls, but the GC has to track that object.
On the other hand, pin_ptr gets translated to a pinned local in IL code. The GC isn't notified about it, but it will get to see that the object is pinned only during a collection. That is, it will notice it's pinned status when looking for object references on the stack.
If you really want to, you can access stack memory in the following way:
[StructLayout(LayoutKind::Explicit, Size = 256)]
public value struct ManagedStruct
{
};
struct NativeStruct
{
char data[256];
};
static void DoSomething()
{
ManagedStruct managed;
auto nativePtr = reinterpret_cast<NativeStruct*>(&managed);
nativePtr->data[42] = 42;
}
There's no pinning at all here, but this is only due to the fact that the managed struct is stored on the stack, and therefore is not relocatable in the first place.
It's a convoluted example, because you could just write:
static void DoSomething()
{
NativeStruct native;
native.data[42] = 42;
}
...and the compiler would perform a similar trick under the covers for you.

How to get access to WriteableBitmap.PixelBuffer pixels with C++?

There are a lot of samples for C#, but only some code snippets for C++ on MSDN. I have put it together and I think it will work, but I am not sure if I am releasing all the COM references I have to.
Your code is correct--the reference count on the IBufferByteAccess interface of *buffer is incremented by the call to QueryInterface, and you must call Release once to release that reference.
However, if you use ComPtr<T>, this becomes much simpler--with ComPtr<T>, you cannot call any of the three members of IUnknown (AddRef, Release, and QueryInterface); it prevents you from calling them. Instead, it encapsulates calls to these member functions in a way that makes it difficult to screw things up. Here's an example of how this would look:
// Get the buffer from the WriteableBitmap:
IBuffer^ buffer = bitmap->PixelBuffer;
// Convert from C++/CX to the ABI IInspectable*:
ComPtr<IInspectable> bufferInspectable(AsInspectable(buffer));
// Get the IBufferByteAccess interface:
ComPtr<IBufferByteAccess> bufferBytes;
ThrowIfFailed(bufferInspectable.As(&bufferBytes));
// Use it:
byte* pixels(nullptr);
ThrowIfFailed(bufferBytes->Buffer(&pixels));
The call to bufferInspectable.As(&bufferBytes) performs a safe QueryInterface: it computes the IID from the type of bufferBytes, performs the QueryInterface, and attaches the resulting pointer to bufferBytes. When bufferBytes goes out of scope, it will automatically call Release. The code has the same effect as yours, but without the error-prone explicit resource management.
The example uses the following two utilities, which help to keep the code clean:
auto AsInspectable(Object^ const object) -> Microsoft::WRL::ComPtr<IInspectable>
{
return reinterpret_cast<IInspectable*>(object);
}
auto ThrowIfFailed(HRESULT const hr) -> void
{
if (FAILED(hr))
throw Platform::Exception::CreateException(hr);
}
Observant readers will notice that because this code uses a ComPtr for the IInspectable* we get from buffer, this code actually performs an additional AddRef/Release compared to the original code. I would argue that the chance of this impacting performance is minimal, and it's best to start from code that is easy to verify as correct, then optimize for performance once the hot spots are understood.
This is what I tried so far:
// Get the buffer from the WriteableBitmap
IBuffer^ buffer = bitmap->PixelBuffer;
// Get access to the base COM interface of the buffer (IUnknown)
IUnknown* pUnk = reinterpret_cast<IUnknown*>(buffer);
// Use IUnknown to get the IBufferByteAccess interface of the buffer to get access to the bytes
// This requires #include <Robuffer.h>
IBufferByteAccess* pBufferByteAccess = nullptr;
HRESULT hr = pUnk->QueryInterface(IID_PPV_ARGS(&pBufferByteAccess));
if (FAILED(hr))
{
throw Platform::Exception::CreateException(hr);
}
// Get the pointer to the bytes of the buffer
byte *pixels = nullptr;
pBufferByteAccess->Buffer(&pixels);
// *** Do the work on the bytes here ***
// Release reference to IBufferByteAccess created by QueryInterface.
// Perhaps this might be done before doing more work with the pixels buffer,
// but it's possible that without it - the buffer might get released or moved
// by the time you are done using it.
pBufferByteAccess->Release();
When using C++/WinRT (instead of C++/CX) there's a more convenient (and more dangerous) alternative. The language projection generates a data() helper function on the IBuffer interface that returns a uint8_t* into the memory buffer.
Assuming that bitmap is of type WriteableBitmap the code can be trimmed down to this:
uint8_t* pixels{ bitmap.PixelBuffer().data() };
// *** Do the work on the bytes here ***
// No cleanup required; it has already been dealt with inside data()'s implementation
In the code pixels is a raw pointer into data controlled by the bitmap instance. As such it is only valid as long as bitmap is alive, but there is nothing in the code that helps the compiler (or a reader) track that dependency.
For reference, there's an example in the WriteableBitmap::PixelBuffer documentation illustrating the use of the (otherwise undocumented) helper function data().