Difference between memcpy_htod and to_gpu in Pycuda? - numpy

I am learning PyCUDA, and while going through the documentation on pycuda.gpuarray, I am puzzled by the difference between pycuda.driver.memcpy_htod (also _dtoh) and pycuda.gpuarray.to_gpu (also get) functions. According to gpuarray documentation, .get().
For example,transfer the contents of self into array or a newly allocated numpy.ndarray. If array is given, it must have the right size (not necessarily shape) and dtype. If it is not given, a pagelocked specifies whether the new array is allocated page-locked.
Is this saying that .get() is implemented exactly the same way as pycuda.driver.memcpy_dtoh? Somehow, I think I am mis-interpreting it.

pycuda.gpuarray.GPUArray.get() stores the GPUArray as a numpy.ndarray.
pycuda.driver.memcpy_dtoh() and friends copy plain buffers between CPU and GPU memory without any processing of the data in the buffers.

Related

Does Pandas DataFrame constructor (and helper construction functions) release the GIL while copying the source array?

Preamble
On another question I understood that python constructors routines does not make a copy of the provided numpy array only if the data-type of the array is the same for all the entries. In case the constructor is fed with a structured numpy array with different types on columns, it makes a copy.
Implementation reference
df_dict = {}
for i in range(5):
obj = Object(1000000)
arr = obj.getNpArr()
print(arr[:10])
df_dict[i] = pandas.DataFrame.from_records(arr)
print("The DataFrames are :")
for i in range(5):
print(df_dict[i].head(10))
In this case Object(N) constructs an instance of Object which internally allocates and initializes a 2D array of shape (N,3), with dtypes 'f8','i4','i4' on each row. Object manages the life of these data, deallocating it on destruction. The function Object.getNpArr() returns a np.recarray pointing to the internal data and it has the above mentioned dtype. Importantly, the returned array does not own the data, it is just a view.
Problem
The DataFrame printed at the end show corrupted data (with respect to the printed array inside the first loop). I am not expecting such behaviour, since the array fed to the pandas construction function is copied (I separately checked this behaviour).
I have not many ideas about the cause and solutions to avoid data corruption. The only guess I can make is:
the constructor starts allocating the memory for its own data, which takes long because of the big size, and then copies
before/during the allocation/copy, the GIL is released and it is taken back to the for loop
for loop proceed before the copy of the array is completed, going to the next iteration
at the next iteration the obj name is moved to the new Object and the memory is deallocated, which causes data corruption in the copy of the DataFrame at the previous iteration, which is probably still running.
If this is really the cause of the issue, how can I find a workaround? Is there a way to let the GIL go through only when the copy of the array is effectively done?
Or, if my guess is wrong, what is the cause of the data corruption?

How to copy SPIR-V Private storage-qualified matrix array into Function storage-qualified matrix array?

I am fighting with driver bug, and long story short i have created OpVariables with proper data in invocation-level global memory, aka Private storage-qualified OpVariable of type float4x3[6].
Now, i need this data converted to Function storage qualifier, as OpVariables in OpFunction scope. But i am kind of lost about when do i apply what copying operations, especially with matrices and arrays, and I have both at once. Do i just OpLoad and OpStore? Or do i need to OpLoad, OpCompositeExtract each matrix from indices, OpCompositeConstruct from them and only afterwards OpStore? The SPIR-V specification on the subject is rather dense, and i cannot seem to find one place where copying operations are described. The rules are probably scattered all over the spec.
After some experimentation it seems that all that is required to convert variables from Private to Function is simple OpLoad(Private) and OpStore to Function variable.

How to get a halide buffer with data on GPU?

I'm new to halide. Now I have a pointer which points to data on GPU. I want to get a halide buffer from this pointer without copying data. I have searched a lot and found this /halidebuffer-on-gpu . It says using Buffer::device_wrap_native will be helpful. And I have read the docs of itBuffer::device_wrap_nativeBut I'm little confused about what value should I pass to device_interface? docs of device_interface don't help me much.
For device_interface you want to pass either halide_cuda_device_interface(), or halide_opencl_device_interface(), or similar. These methods are all defined in HalideRuntime*.h. Here's the full list:
HalideRuntimeCuda.h: halide_cuda_device_interface();
HalideRuntimeD3D12Compute.h: halide_d3d12compute_device_interface();
HalideRuntimeHexagonDma.h: halide_hexagon_dma_device_interface();
HalideRuntimeHexagonHost.h: halide_hexagon_device_interface();
HalideRuntimeMetal.h: halide_metal_device_interface();
HalideRuntimeOpenCL.h: halide_opencl_device_interface();
HalideRuntimeOpenGL.h: halide_opengl_device_interface();
HalideRuntimeOpenGLCompute.h: halide_openglcompute_device_interface();

Object Array to Byte Array - Marshal.AllocHGlobal Fragmentation query

I didn't think it fair to post a comment on Fredrik Mörk's answer in this 2 year old post, so I thought I'd just ask it as a new question instead...
NB: This is not a critiscm of the answer in any way, I'm simply trying to understand this all before delving into memory management / the marshal class.
In that answer, the function GetByteArray allocates memory to each object within the given array, within a loop.
Would the GetByteArray function on the aforementioned post have benefited at all from allocating memory for the total size of the provided array:
Dim arrayBufferPtr = Marshal.AllocHGlobal(Marshal.SizeOf(<arrayElement>) * <array>.Count)
I just wonder if allocating the memory, as shown in the answer, causes any kind of fragmentation? Assuming there may be fragmentation, would there be much of an impact to be concerned with? Would allocating the memory in the way I've shown force you to call IntPtr.ToInt## to obtain pointer offsets from the overall allocation pointer, and therefore force you to check the underlying architecture to ensure the correct method is used*1 or is there a better way? (ToInt32/ToInt64 depending on x86/64?)
*1 I read elsewhere that calling the wrong IntPtr.ToInt## will cause overflow exceptions. What I mean by that statement is would I use:
Dim anOffsetPtr As New IntPtr(arrayBufferPtr.ToInt## + (loopIndex * <arrayElementSize>))
I've read through a few articles on the VB.Net Marshal class and memory allocation; listed below, but if you know fo any other good articles I'm all ears!
http://msdn.microsoft.com/en-us/library/system.runtime.interopservices.marshal.aspx
http://www.dotnetbips.com/articles/44bad06d-3662-41d3-b712-b45546cd8fa8.aspx
My favourite so far:
http://www.codeproject.com/KB/vb/Marshal.aspx
It is possible to allocate unmanaged memory for the whole array, and then copy every array element with SizeOf(arrayElement)*loopIndex offset. It is better to use appropriate ToInt32/ToInt64 method, according to the current platform, like:
Dim anOffsetPtr
if arrayBufferPtr.Size = 4 then
anOffsetPtr = New IntPtr(arrayBufferPtr.ToInt32() + (loopIndex * arrayElementSize))
else
anOffsetPtr = New IntPtr(arrayBufferPtr.ToInt64() + (loopIndex * arrayElementSize))
endif

How do you pre-size an array in Lua?

I've got a Lua program that seems to be slower than it ought to be. I suspect the issue is that I'm adding values to an associative array one at a time and the table has to allocate new memory each time.
There did seem to be a table.setn function, but it fails under Lua 5.1.3:
stdin:1: 'setn' is obsolete
stack traceback:
[C]: in function 'setn'
stdin:1: in main chunk
[C]: ?
I gather from the Google searching I've done that this function was depreciated in Lua 5.1, but I can't find what (if anything) replaced the functionality.
Do you know how to pre-size a table in Lua?
Alternatively, is there some other way to avoid memory allocation when you add an object to a table?
Let me focus more on your question:
adding values to an associative array
one at a time
Tables in Lua are associative, but using them in an array form (1..N) is optimized. They have double faces, internally.
So.. If you indeed are adding values associatively, follow the rules above.
If you are using indices 1..N, you can force a one-time size readjust by setting t[100000]= something. This should work until the limit of optimized array size, specified within Lua sources (2^26 = 67108864). After that, everything is associative.
p.s. The old 'setn' method handled the array part only, so it's no use for associative usage (ignore those answers).
p.p.s. Have you studied general tips for keeping Lua performance high? i.e. know table creation and rather reuse a table than create a new one, use of 'local print=print' and such to avoid global accesses.
static int new_sized_table( lua_State *L )
{
int asize = lua_tointeger( L, 1 );
int hsize = lua_tointeger( L, 2 );
lua_createtable( L, asize, hsize );
return( 1 );
}
...
lua_pushcfunction( L, new_sized_table );
lua_setglobal( L, "sized_table" );
Then, in Lua,
array = function(size) return sized_table(size,0) end
a = array(10)
As a quick hack to get this running you can add the C to lua.c.
I don't think you can - it's not an array, it's an associative array, like a perl hash or an awk array.
http://www.lua.org/manual/5.1/manual.html#2.5.5
I don't think you can preset its size meaningfully from the Lua side.
If you're allocating the array on the C side, though, the
void lua_createtable (lua_State *L, int narr, int nrec);
may be what you need.
Creates a new empty table and pushes
it onto the stack. The new table has
space pre-allocated for narr array
elements and nrec non-array elements.
This pre-allocation is useful when you
know exactly how many elements the
table will have. Otherwise you can use
the function lua_newtable.
There is still an internal luaL_setn and you can compile Lua so that
it is exposed as table.setn. But it looks like that it won't help
because the code doesn't seem to do any pre-extending.
(Also the setn as commented above the setn is related to the array part
of a Lua table, and you said that your are using the table as an associative
array)
The good part is that even if you add the elements one by one, Lua does not
increase the array that way. Instead it uses a more reasonable strategy. You still
get multiple allocations for a larger array but the performance is better than
getting a new allocation each time.
Although this doesn't answer your main question, it answers your second question:
Alternatively, is there some other way to avoid memory allocation when you add an object to a table?
If your running Lua in a custom application, as I can guess since your doing C coding, I suggest you replace the allocator with Loki's small value allocator, it reduced my memory allocations 100+ fold. This improved performance by avoiding round trips to the Kernel, and made me a much happier programmer :)
Anyways I tried other allocators, but they were more general, and provide guarantee's that don't benefit Lua applications (such as thread safety, and large object allocation, etc...), also writing your own small-object allocator can be a good week of programming and debugging to get just right, and after searching for an available solution Loki's allocator wasthe easiest and fastest I found for this problem.
If you declare your table in code with a specific amount of items, like so:
local tab = { 0, 1, 2, 3, 4, 5, ... , n }
then Lua will create the table with memory already allocated for at least n items.
However, Lua uses the 2x incremental memory allocation technique, so adding an item to a table should rarely force a reallocation.