How to get a halide buffer with data on GPU? - gpu

I'm new to halide. Now I have a pointer which points to data on GPU. I want to get a halide buffer from this pointer without copying data. I have searched a lot and found this /halidebuffer-on-gpu . It says using Buffer::device_wrap_native will be helpful. And I have read the docs of itBuffer::device_wrap_nativeBut I'm little confused about what value should I pass to device_interface? docs of device_interface don't help me much.

For device_interface you want to pass either halide_cuda_device_interface(), or halide_opencl_device_interface(), or similar. These methods are all defined in HalideRuntime*.h. Here's the full list:
HalideRuntimeCuda.h: halide_cuda_device_interface();
HalideRuntimeD3D12Compute.h: halide_d3d12compute_device_interface();
HalideRuntimeHexagonDma.h: halide_hexagon_dma_device_interface();
HalideRuntimeHexagonHost.h: halide_hexagon_device_interface();
HalideRuntimeMetal.h: halide_metal_device_interface();
HalideRuntimeOpenCL.h: halide_opencl_device_interface();
HalideRuntimeOpenGL.h: halide_opengl_device_interface();
HalideRuntimeOpenGLCompute.h: halide_openglcompute_device_interface();

Related

How to add a new syntax element in HM (HEVC test Model)

I've been working on the HM reference software for a while, to improve something in the intra prediction part. Now a new intra prediction algorithm is added to the code and I let the encoder choose between my algorithm and the default algorithm of HM (according to the RDCost of course).
What I need now, is to signal a flag for each PU, so that the decoder will be able to perform the same algorithm as the encoder decides in the rate distortion loop.
I want to know what exactly should I do to properly add this one bit flag to the stream, without breaking anything in the code.
Assuming that I want to use a CABAC context model to keep the track of my flag's statistics, what else should I do:
adding a new context model like ContextModel3DBuffer m_cCUIntraAlgorithmSCModel to the TEncSbac.h file.
properly initializing the model (both at encoder and decoder side) by looking at how the HM initialezes other context models.
calling the function m_pcBinIf->encodeBin(myFlag, cCUIntraAlgorithmSCModel) and m_pcTDecBinIfdecodeBin(myFlag, cCUIntraAlgorithmSCModel) at the encoder side and decoder side, respectively.
I take these three steps but apparently it breaks something.
PS: Even an equiprobable signaling (i.e. without using CABAC contexts) will be useful. I just want to send this flag peacefully!
Thanks in advance.
I could solve this problem finally. It was a bug in the CABAC context initialization.
But I want to share this experience as many people may want to do the same thing.
The three steps that I explained are essentially necessary to add a new syntax element, but one might be very careful with the followings:
In the beginning, you need to decide either you want to use a separate context model for your syntax element? Or you want to use an existing one? In case of CABAC separation, you should define a ContextModel3DBuffer and the best way to do that is: finding a similar syntax element in the code; then duplicating its ``ContextModel3DBuffer'' definition and ALL of its occurences in the code. This way assures that you are considering everything.
Encoding of each syntax elements happens in two different places: first, in the RDO loop to make a "decision", and second, during the actual encoding phase and when the decisions are being encoded (e.g. encodeCtu function).
The order of encoding/decoding syntaxt elements should be the same at the encoder/decoder sides. For example if your new syntax element is encoded after splitFlag and before predMode at the encoder side, you should decode it exactly between splitFlag and predMode at the decoder side.
The context model is implemented as a 3D matrix in order to let track the statistics of syntaxt elements separately for different block sizes, componenets etc. This means that when you want to call the function encodeBin, you may make sure that a correct index is being used. I've made stupid mistakes in this part!
Apart from the above remarks, I found a the function getState very useful for debugging. This function returns the state of your CABAC context model in an arbitrary place of the code when you have access to it. It is very useful to compare the state at the same place of the encoder and the decoder when you have a mismatch. For example, it happens a lot that you encode a 1 but you decode a 0. In this case, you need to check the state of your CABAC context before encoding and decoding. They should be the same. If they are not the same, track back the error to find the first place of mismatch.
I hope it was helpful.

Read XDATA value to app in CC2541

I want to read the MAC address for the Bluetooth LE chip CC2541. This is stored in memory location 0x780C. I went through the
osal_snv_read
function but I don't know what osalSnvId_t id is. A brief explanation about how this workds would be really helpful.
Apparently the location where MAC address is stored cannot be read using osal_snv_read. So either I have to use
GAPRole_GetParameter(GAPROLE_BD_ADDR, ownAddress);
after
GAPROLE_STARTED
or I have to use
__xdata __no_init uint8 mac_id[6] # 0x780C;
__xdata to say it is reading from XDATA memory and __no_init to tell the compiler not to initialize this variable. Also, this had to be kept outside any function to prevent it from declaring as auto variable.
Credits: http://e2e.ti.com/support/low_power_rf/f/538/t/273968.aspx

Difference between memcpy_htod and to_gpu in Pycuda?

I am learning PyCUDA, and while going through the documentation on pycuda.gpuarray, I am puzzled by the difference between pycuda.driver.memcpy_htod (also _dtoh) and pycuda.gpuarray.to_gpu (also get) functions. According to gpuarray documentation, .get().
For example,transfer the contents of self into array or a newly allocated numpy.ndarray. If array is given, it must have the right size (not necessarily shape) and dtype. If it is not given, a pagelocked specifies whether the new array is allocated page-locked.
Is this saying that .get() is implemented exactly the same way as pycuda.driver.memcpy_dtoh? Somehow, I think I am mis-interpreting it.
pycuda.gpuarray.GPUArray.get() stores the GPUArray as a numpy.ndarray.
pycuda.driver.memcpy_dtoh() and friends copy plain buffers between CPU and GPU memory without any processing of the data in the buffers.

Object Array to Byte Array - Marshal.AllocHGlobal Fragmentation query

I didn't think it fair to post a comment on Fredrik Mörk's answer in this 2 year old post, so I thought I'd just ask it as a new question instead...
NB: This is not a critiscm of the answer in any way, I'm simply trying to understand this all before delving into memory management / the marshal class.
In that answer, the function GetByteArray allocates memory to each object within the given array, within a loop.
Would the GetByteArray function on the aforementioned post have benefited at all from allocating memory for the total size of the provided array:
Dim arrayBufferPtr = Marshal.AllocHGlobal(Marshal.SizeOf(<arrayElement>) * <array>.Count)
I just wonder if allocating the memory, as shown in the answer, causes any kind of fragmentation? Assuming there may be fragmentation, would there be much of an impact to be concerned with? Would allocating the memory in the way I've shown force you to call IntPtr.ToInt## to obtain pointer offsets from the overall allocation pointer, and therefore force you to check the underlying architecture to ensure the correct method is used*1 or is there a better way? (ToInt32/ToInt64 depending on x86/64?)
*1 I read elsewhere that calling the wrong IntPtr.ToInt## will cause overflow exceptions. What I mean by that statement is would I use:
Dim anOffsetPtr As New IntPtr(arrayBufferPtr.ToInt## + (loopIndex * <arrayElementSize>))
I've read through a few articles on the VB.Net Marshal class and memory allocation; listed below, but if you know fo any other good articles I'm all ears!
http://msdn.microsoft.com/en-us/library/system.runtime.interopservices.marshal.aspx
http://www.dotnetbips.com/articles/44bad06d-3662-41d3-b712-b45546cd8fa8.aspx
My favourite so far:
http://www.codeproject.com/KB/vb/Marshal.aspx
It is possible to allocate unmanaged memory for the whole array, and then copy every array element with SizeOf(arrayElement)*loopIndex offset. It is better to use appropriate ToInt32/ToInt64 method, according to the current platform, like:
Dim anOffsetPtr
if arrayBufferPtr.Size = 4 then
anOffsetPtr = New IntPtr(arrayBufferPtr.ToInt32() + (loopIndex * arrayElementSize))
else
anOffsetPtr = New IntPtr(arrayBufferPtr.ToInt64() + (loopIndex * arrayElementSize))
endif

Calling functions from within function(float *VeryBigArray,long SizeofArray) from within objC method fails with EXC_BAD_ACCESS

Ok I finally found the problem. It was inside the C function(CarbonTuner2) not the objC method. I was creating inside the function an array of the same size as the file size so if the filesize was big it created a really big array and my guess is that when I called another function from there, the local variables were put on the stack which created the EXC_BAD_ACCESS. What I did then is instead of using a variable to declare to size of the array I put the number directly. Then the code didnt even compile. it knew. The error wassomething like: Array size too big. I guess working 20+hours in a row isnt good XD But I am definitly gonna look into tools other than step by step debuggin to figure these ones out. Thanks for your help. Here is the code. If you divide gFileByteCount by 2 you dont get the error anymore:
// ConverterController.h
# import <Cocoa/Cocoa.h>
# import "Converter.h"
#interface ConverterController : NSObject {
UInt64 gFileByteCount ;
}
-(IBAction)ProcessFile:(id)sender;
void CarbonTuner2(long numSampsToProcess, long fftFrameSize, long osamp);
#end
// ConverterController.m
# include "ConverterController.h"
#implementation ConverterController
-(IBAction)ProcessFile:(id)sender{
UInt32 packets = gTotalPacketCount;//alloc a buffer of memory to hold the data read from disk.
gFileByteCount=250000;
long LENGTH=(long)gFileByteCount;
CarbonTuner2(LENGTH,(long)8192/2, (long)4*2);
}
#end
void CarbonTuner2(long numSampsToProcess, long fftFrameSize, long osamp)
{
long numFrames = numSampsToProcess / fftFrameSize * osamp;
float g2DFFTworksp[numFrames+2][2 * fftFrameSize];
double hello=sin(2.345);
}
Your crash has nothing to do with incompatibilities between C and ObjC.
And as previous posters said, you don't need to include math.h.
Run your code under gdb, and see where the crash happens by using backtrace.
Are you sure you're not sending bad arguments to the math functions?
E.g. this causes BAD_ACCESS:
double t = cos(*(double *)NULL);
Objective C is built directly on C, and the C underpinnings can and do work.
For an example of using math.h and parts of standard library from within an Objective C module, see:
http://en.wikibooks.org/wiki/Objective-C_Programming/syntax
There are other examples around.
Some care is needed around passing the variables around; use the C variables for the C and standard library calls; don't mix the C data types and Objective C data types incautiously. You'll usually want a conversion here.
If that is not the case, then please consider posting the code involved, and the error(s) you are receiving.
And with all respect due to Mr Hellman's response, I've hit errors when I don't have the header files included; I prefer to include the headers. But then, I tend to dial the compiler diagnostics up a couple of notches, too.
For what it's worth, I don't include math.h in my Cocoa app but have no problem using math functions (in C).
For example, I use atan() and don't get compiler errors, or run time errors.
Can you try this without including math.h at all?
First, you should add your code to your question, rather than posting it as an answer, so people can see what you're asking about. Second, you've got all sorts of weird problems with your memory management here - gFileByteCount is used to size a bunch of buffers, but it's set to zero, and doesn't appear to get re-set anywhere.
err = AudioFileReadPackets (fileID,
false, &bytesReturned, NULL,0,
&packets,(Byte *)rawAudio);
So, at this point, you pass a zero-sized buffer to AudioFileReadPackets, which prompty overruns the heap, corrupting the value of who knows what other variables...
fRawAudio =
malloc(gFileByteCount/(BITS/8)*sizeof(fRawAudio));
Here's another, minor error - you want sizeof(*fRawAudio) here, since you're trying to allocate an array of floats, not an array of float pointers. Fortunately, those entities are the same size, so it doesn't matter.
You should probably start with some example code that you know works (SpeakHere?), and modify it. I suspect there are other similar problems in the code yoou posted, but I don't have time to find them right now. At least get the rawAudio buffer appropriately-sized and use the values returned from AudioFileReadPackets appropriately.