I have a program in VB.net that uses a 3D array:
Private gridList(10, 900, 900) As GridElement
Now, I just used a Memory Profiler on it (because my application is having some major leak issues or something) and apparently, this array (containing at the moment of testing 0-30 elements at one time) is using 94% of the memory currently in use by my application. Even when it is empty it takes up huge amounts of memory.
My only assumption is that even empty arrays take up space! This puts a major blow into my plans!
My Question:
Is there any alternative to this that allows me to still have the same abilities to map
i.g. I've been using it like this:
Dim cGE as GridElement = gridList(3, 5, 7)
but doesn't hog up so much memory for things that aren't using memory?
Thanks!
Do Arrays take up space even without values in them in .net?
No. But your array has values in it. And hence takes up space.
To avoid keeping a lot of elements in memory when you only access a few of all the possible elements, you need to use a so-called sparse array. In .NET, this is easiest implemented via a Dictionary, where the key in your case would be a three-element structure*, and the value would be a GridElement.
* If you’re using an up-to-date version of .NET, then you can model this via a Tuple(Of Integer, Integer, Integer)
Related
A C++ standard library implements std::copy with the following code (ignoring all sorts of wrappers and concept checks etc) with the simple loop:
for (; __first != __last; ++__result, ++__first)
*__result = *__first;
Now, suppose I want a general-purpose std::copy-like function for warps (not blocks; not grids) to use for collaboratively copying data from one place to another. Let's even assume for simplicity that the function takes pointers rather than an arbitrary iterator.
Of course, writing general-purpose code in CUDA is often a useless pursuit - since we might be sacrificing a lot of the benefit of using a GPU in the first place in favor of generality - so I'll allow myself some boolean/enum template parameters to possibly select between frequently-occurring cases, avoiding runtime checks. So the signature might be, say:
template <typename T, bool SomeOption, my_enum_t AnotherOption>
T* copy(
T* __restrict__ destination,
const T* __restrict__ source,
size_t length
);
but for each of these cases I'm aiming for optimal performance (or optimal expected performance given that we don't know what other warps are doing).
Which factors should I take into consideration when writing such a function? Or in other words: Which cases should I distinguish between in implementing this function?
Notes:
This should target Compute Capabilities 3.0 or better (i.e. Kepler or newer micro-architectures)
I don't want to make a Runtime API memcpy() call. At least, I don't think I do.
Factors I believe should be taken into consideration:
Coalescing memory writes - ensuring that consecutive lanes in a warp write to consecutive memory locations (no gaps).
Type size vs Memory transaction size I - if sizeof(T) is sizeof(T) is 1 or 2, and we have have each lane write a single element, the entire warp would write less than 128B, wasting some of the memory transaction. Instead, we should have each thread place 2 or 4 input elements in a register, and write that
Type size vs Memory transaction size II - For type sizes such that lcm(4, sizeof(T)) > 4, it's not quite clear what to do. How well does the compiler/the GPU handle writes when each lane writes more than 4 bytes? I wonder.
Slack due to the reading of multiple elements at a time - If each thread wishes to read 2 or 4 elements for each write, and write 4-byte integers - we might have 1 or 2 elements at the beginning and the end of the input which must be handled separately.
Slack due to input address mis-alignment - The input is read in 32B transactions (under reasonable assumptions); we thus have to handle the first elements up to the multiple of 32B, and the last elements (after the last such multiple,) differently.
Slack due to output address mis-alignment - The output is written in transactions of upto 128B (or is it just 32B?); we thus have to handle the first elements up to the multiple of this number, and the last elements (after the last such multiple,) differently.
Whether or not T is trivially-copy-constructible. But let's assume that it is.
But it could be that I'm missing some considerations, or that some of the above are redundant.
Factors I've been wondering about:
The block size (i.e. how many other warps are there)
The compute capability (given that it's at least 3)
Whether the source/target is in shared memory / constant memory
Choice of caching mode
I am trying to create city distances array CityDistances(CityId1, CityId2) = Distance
As I don't know how many citys there would be, I need an unlimited array. I tried creating it via Dim CityDistances(,) As Double, but when I try to use it, it throws an exception. How do I achieve that?
Instead of an array, a possibly better alternative (and a little more OOP) is through the use of a List(Of Type).
Dim d As List(Of DIstances) = New List(Of Distances)()
d.Add(New Distances() With {.CityID1=1, .CityID2=2, .Distance=12.4})
d.Add(New Distances() With {.CityID1=1, .CityID2=3, .Distance=15.1})
d.Add(New Distances() With {.CityID1=1, .CityID2=4, .Distance=2.4})
d.Add(New Distances() With {.CityID1=1, .CityID2=5, .Distance=130.1})
Public Class Distances
Public CityID1 as Integer
Public CityID2 as Integer
Public Distance as Double
End Class
This has the advantage to let your list grow without specifying any initial limits.
First of all, by doing this:
Dim CityDistances(,) As Double
You declare a pointer to a two dimensional array, without performing any memory allocation for its elements. It is expected that some method will return this array, and you will use it via this pointer. If you try to use it AS IS, without assigning anything to it, you will get index-out-of-bounds exception, which is normal.
Second, there is no such thing as unlimited arrays. You need to be using list of lists, dictionary of dictionaries, a DataTable or similar, if you want automatic memory management. If you want to stick with arrays, for performance/convenience reasons, you can ReDim it when required (increase dimensions), and preserve contents, like this:
Dim CityDistances(,) As Double
ReDim Preserve CityDistances(10,10)
'.... code goes here ....
ReDim Preserve CityDistances(100,100)
Make sure you know when to ReDim, because every time you do it, .NET will create another array and copy all elements there. As the size of your array grows, it may become a performance factor.
Third, based on the nature of your question, you may want to look into custom implementations of the Matrix class. Here are some links I found through Google, hope you find them useful. Those are C#, but there are free online converters to VB.NET on the internet.
Lightweight fast matrix class in C# (Strassen algorithm, LU decomposition) (free)
Efficient Matrix Programming in C# (code-project, so free as well)
High performance matrix algebra for .NET programming! (paid, 99$ and up)
I am using Cocoa/Objective-C and I am using NSBitmapImageRep getPixel:atX:y: to test whether R is 0 or 255. That is the only piece of data I need (the bitmap is only black and white).
I am noticing that this one function is the biggest draw on CPU power in my application, accounting for something like 95% of the overhead. Would it be faster for me to preload the bitmap into a 2 dimensional integer array
NSUInteger pixels[1280][1024];
and read the values like so:
if(pixels[x][y]!=0){
//....do stuff
}
?
One thing that might be helpful could be converting the data into something more "dense". Since you're only interested in a single bit per pixel location, it doesn't make sense to store more than that. Storing more data than necessary means you get less usage out of your cache, which can really slow things down if the image is big and/or the accesses very random.
For instance, you could use the platform's largest "native" integer and pack in the pixels to use a single bit for each pixel. That will make the access a bit more involved since you need to do a single-bit testing, but it might be a win.
You would do something like this:
uint32_t image[HEIGHT * ((WIDTH + 31) / 32)];
Then initialize this array by using the slow getter method, once per pixel. Then you can read out the value of a pixel using something like image[y * ((WIDTH + 31) / 32) + (x / 32)] & (1 << (x & 31)).
I'm being vague ("might", "can" and so on) since it really depends on your access pattern, the size of the image, and other things. You should probably test it.
I'm not familiar with Objective-C or the NSBitmapImageRep object, but a reasonable guess is that the getPixel routine employs clipping to avoid reading outside of memory, which could a possible slowdown (among other things).
Have a look inside it and see what it does.
(update)
Having learnt that this is Apple code, you probably can't take a look inside it.
However, the documentation for NSBitmapImageRep_Class seems to indicate that getPixel:atX:y: performs at least some type magic. You could test if the result is clipped by accessing a pixel outside of the image boundary and observing the result.
The bitmapData seems to be something you'd be interested in: get the pointer to the data, then read the array yourself avoiding type conversion or clipping.
I didn't think it fair to post a comment on Fredrik Mörk's answer in this 2 year old post, so I thought I'd just ask it as a new question instead...
NB: This is not a critiscm of the answer in any way, I'm simply trying to understand this all before delving into memory management / the marshal class.
In that answer, the function GetByteArray allocates memory to each object within the given array, within a loop.
Would the GetByteArray function on the aforementioned post have benefited at all from allocating memory for the total size of the provided array:
Dim arrayBufferPtr = Marshal.AllocHGlobal(Marshal.SizeOf(<arrayElement>) * <array>.Count)
I just wonder if allocating the memory, as shown in the answer, causes any kind of fragmentation? Assuming there may be fragmentation, would there be much of an impact to be concerned with? Would allocating the memory in the way I've shown force you to call IntPtr.ToInt## to obtain pointer offsets from the overall allocation pointer, and therefore force you to check the underlying architecture to ensure the correct method is used*1 or is there a better way? (ToInt32/ToInt64 depending on x86/64?)
*1 I read elsewhere that calling the wrong IntPtr.ToInt## will cause overflow exceptions. What I mean by that statement is would I use:
Dim anOffsetPtr As New IntPtr(arrayBufferPtr.ToInt## + (loopIndex * <arrayElementSize>))
I've read through a few articles on the VB.Net Marshal class and memory allocation; listed below, but if you know fo any other good articles I'm all ears!
http://msdn.microsoft.com/en-us/library/system.runtime.interopservices.marshal.aspx
http://www.dotnetbips.com/articles/44bad06d-3662-41d3-b712-b45546cd8fa8.aspx
My favourite so far:
http://www.codeproject.com/KB/vb/Marshal.aspx
It is possible to allocate unmanaged memory for the whole array, and then copy every array element with SizeOf(arrayElement)*loopIndex offset. It is better to use appropriate ToInt32/ToInt64 method, according to the current platform, like:
Dim anOffsetPtr
if arrayBufferPtr.Size = 4 then
anOffsetPtr = New IntPtr(arrayBufferPtr.ToInt32() + (loopIndex * arrayElementSize))
else
anOffsetPtr = New IntPtr(arrayBufferPtr.ToInt64() + (loopIndex * arrayElementSize))
endif
I've got a Lua program that seems to be slower than it ought to be. I suspect the issue is that I'm adding values to an associative array one at a time and the table has to allocate new memory each time.
There did seem to be a table.setn function, but it fails under Lua 5.1.3:
stdin:1: 'setn' is obsolete
stack traceback:
[C]: in function 'setn'
stdin:1: in main chunk
[C]: ?
I gather from the Google searching I've done that this function was depreciated in Lua 5.1, but I can't find what (if anything) replaced the functionality.
Do you know how to pre-size a table in Lua?
Alternatively, is there some other way to avoid memory allocation when you add an object to a table?
Let me focus more on your question:
adding values to an associative array
one at a time
Tables in Lua are associative, but using them in an array form (1..N) is optimized. They have double faces, internally.
So.. If you indeed are adding values associatively, follow the rules above.
If you are using indices 1..N, you can force a one-time size readjust by setting t[100000]= something. This should work until the limit of optimized array size, specified within Lua sources (2^26 = 67108864). After that, everything is associative.
p.s. The old 'setn' method handled the array part only, so it's no use for associative usage (ignore those answers).
p.p.s. Have you studied general tips for keeping Lua performance high? i.e. know table creation and rather reuse a table than create a new one, use of 'local print=print' and such to avoid global accesses.
static int new_sized_table( lua_State *L )
{
int asize = lua_tointeger( L, 1 );
int hsize = lua_tointeger( L, 2 );
lua_createtable( L, asize, hsize );
return( 1 );
}
...
lua_pushcfunction( L, new_sized_table );
lua_setglobal( L, "sized_table" );
Then, in Lua,
array = function(size) return sized_table(size,0) end
a = array(10)
As a quick hack to get this running you can add the C to lua.c.
I don't think you can - it's not an array, it's an associative array, like a perl hash or an awk array.
http://www.lua.org/manual/5.1/manual.html#2.5.5
I don't think you can preset its size meaningfully from the Lua side.
If you're allocating the array on the C side, though, the
void lua_createtable (lua_State *L, int narr, int nrec);
may be what you need.
Creates a new empty table and pushes
it onto the stack. The new table has
space pre-allocated for narr array
elements and nrec non-array elements.
This pre-allocation is useful when you
know exactly how many elements the
table will have. Otherwise you can use
the function lua_newtable.
There is still an internal luaL_setn and you can compile Lua so that
it is exposed as table.setn. But it looks like that it won't help
because the code doesn't seem to do any pre-extending.
(Also the setn as commented above the setn is related to the array part
of a Lua table, and you said that your are using the table as an associative
array)
The good part is that even if you add the elements one by one, Lua does not
increase the array that way. Instead it uses a more reasonable strategy. You still
get multiple allocations for a larger array but the performance is better than
getting a new allocation each time.
Although this doesn't answer your main question, it answers your second question:
Alternatively, is there some other way to avoid memory allocation when you add an object to a table?
If your running Lua in a custom application, as I can guess since your doing C coding, I suggest you replace the allocator with Loki's small value allocator, it reduced my memory allocations 100+ fold. This improved performance by avoiding round trips to the Kernel, and made me a much happier programmer :)
Anyways I tried other allocators, but they were more general, and provide guarantee's that don't benefit Lua applications (such as thread safety, and large object allocation, etc...), also writing your own small-object allocator can be a good week of programming and debugging to get just right, and after searching for an available solution Loki's allocator wasthe easiest and fastest I found for this problem.
If you declare your table in code with a specific amount of items, like so:
local tab = { 0, 1, 2, 3, 4, 5, ... , n }
then Lua will create the table with memory already allocated for at least n items.
However, Lua uses the 2x incremental memory allocation technique, so adding an item to a table should rarely force a reallocation.