What is the limit on automatic storage on OS X in Obj-C, and why do I need to use malloc instead of a normal auto array for large arrays? - objective-c

I came across a strange error today, and I still don't understand it:
long long N = 2000;
long long N2 = N*N;
long long *s = malloc(sizeof(long long)*N2); // create array
// populate it
for (long long k = 1; k <= 55; k++) {
doesn't produce any errors, but
long long N = 2000;
long long N2 = N*N;
long long s[4000000]; // create array
// populate it
for (long long k = 1; k <= 55; k++) {
gives me a code=2 EXC_BAD_ACCESS on the for line before assigning 1 to k (according to the debugger), as if there was no space left to allocate another 8-byte variable. This code is at the beginning of a method; no other variables have been assigned or allocated. I'm guessing that I simply can't allocate a 4000000-element long long array to the stack, but somehow I can allocate it to the dynamic heap. Could someone please explain what's going on, what the limits are, etc.? This is Objective-C on a Mac running Mountain Lion, 2GB RAM. A long long is 8 bytes wide, so the array should only be 32MB; I can't see why this should be an issue.
Thank you!
(By the way, if the details look familiar, it's because this is the beginning of my solver for Project Euler's Problem 149. I've avoided mentioning any details of the solution here, as I've solved the problem already.)

Your first example allocates memory from the heap, which is in the “data segment”, and your second allocates memory on the stack, which is in the “stack segment”. Each of these has a different size limit.
The default stack segment size limit, according to Technical Q&A QA1419, is 8 MiB. You can double-check this by running ulimit -a in a terminal:
:; ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 256
pipe size (512 bytes, -p) 1
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 709
virtual memory (kbytes, -v) unlimited
As you can see, the stack size is limited to 8192 KiB = 8 MiB.
The Technical Q&A I linked above describes some ways to increase the stack size limit. The maximum to which you can increase it without running as root is 64 MiB.
If you create threads, each thread gets its own stack. According to the Q&A, you can set a thread's stack size up to 1 GiB if you use the NSThread API.

Auto locals are allocated on the stack; according to this technical note, the default stack size for an OSX process's main thread is 8MB, and less for additional threads. You can try the linker option or setrlimit solutions given in the note, but C tradition is to use the heap for any large allocations.

Related

The proper way to invalidate and flush Vulkan memory

When flushing and invalidating non-coherent memory in Vulkan you need to do it to ranges with a starting offset and size both aligned to an alignment called 'nonCoherentAtomSize', which on my physical device is 128 bytes. To do this you would round DOWN the starting offset and round UP the size to this alignment (128 bytes). The issue I can see is that types have a less strict (smaller) alignment, and this rounding up and down can spill the range outside the memory of the allocation. So:
// CREATE A BUFFER WITH SIZE 17
VkMemoryRequirements memRequirements;
vkGetBufferMemoryRequirements(logicalDevice, vk_buffer, &memRequirements);
memRequirements.size; // == 20
memRequirements.alignment; // == 4
// ON MY SETUP
Let's just say I allocate 20 bytes of memory and I this buffer at the beginning, (0), and I want to flush this range, I would flush offset 0 with size 20 (but this needs to rounded up to 128 (nonCoherentAtomSize), which is bigger than the buffer. This isn't right, right? Likewise, is the memory returned from vkAllocateMemory guaranteed to be aligned to at least the nonCoherentAtomSize? If not the memory might begin only at a 16-byte alignment, and if I round down then I'm flushing a range before the memory, right?
Edit: Sorry, it's impossible in the case of rounding down, because the argument to flush and invalidate is an offset, anything rounded down to its alignment cannot be less than 0. But in the case of rounding up it's still a problem I can see.
There is never a reason to map a range of an allocation which is not aligned to nonCoherentAtomSize. If you do this, then you will find that you will be unable to properly flush or invalidate part of that range.
Indeed, there is no reason to ever map only part of an allocation you intend to map. Just map the whole thing, immediately after allocating it. At which point, you can use VK_WHOLE_SIZE to specify a size if the nonCoherentAtomSize aligned size exceeds the allocation range.

Opencl Maximum Size of Private memory per Work Item

I Have an AMD RX 570 4G,
Opencl tells me that I can use a Maximum of 256 Workgroup and 256 WorkItem per group...
Let's say I use all 256 Workgroup with 256 WorkItem in each of them,
Now, What is the Maximum Size of private memory per work item?
Is Private memory Equal to Total VRAM(4GB) Divided by Total Work Items(256x256)?
Or is it equal to Cache if so, how?
VRAM is represented in OpenCL as global memory.
Private memory is initially allocated from the register file. Your RX 570 is from AMD's Polaris architecture, a.k.a. GCN 4 where each compute unit (64 shader processors) has access to 256 vector (SIMD) registers (64x32 bits wide) and 512 32-bit scalar registers. So that works out to about 66KiB per CU, but it's not as simple as just quoting that total.
A workgroup will always be scheduled on a single compute unit, so if you assign it 256 work items, then it will have to perform every vector instruction 4 times in sequence (64 x 4 = 256) and the vector registers will (simplifying slightly) effectively have to be treated as 64 256-entry registers.
Scalar registers are used for data and calculations which are identical on each work item, e.g. incrementing a loop counter, holding buffer base pointers, etc.
Private memory will usually spill to global if you use more than will fit in your register file. So performance simply drops.
So essentially, on GCN, your optimal workgroup size is usually 64. Use as little private memory as possible; definitely aim for less than half of the available register file so that more than one workgroup can be scheduled so latency from memory access can be papered over, otherwise your shader cores will be spending a lot of time just waiting for data to arrive or be written out.
Cache is used for OpenCL local and constant memory spaces. (Constant will again spill to global if you try to use too much. The size of local memory can be checked via the OpenCL API and again is divided among workgroups scheduled on the same compute unit, so if you use more than half, only one group can run on a CU, etc.)
I don't know where you're getting a limit of 256 workgroups from, the limit is essentially set by whether the GPU uses 32-bit or 64-bit addressing. Most applications won't get close to 4bn work items even in the 32-bit case.
Private memory space is registers on the GPU die (0 cycle access latency) and not related to the amount of VRAM (global memory space) at all. The amount of private memory depends on the device (private memory per compute unit).
I don't know private memory size for the RX 570, but for older HD7000 series GPUs it is 256kB per CU. If you have a work group size of 256, you get 1kB per work item, which is equal to 256 float variables.
Cache size determines the size of local and constant memory space.

Loading large set of images kill the process

Loading 1500 images of size (1000,1000,3) breaks the code and throughs kill 9 without any further error. Memory used before this line of code is 16% of system total memory. Total size of images direcotry is 7.1G.
X = np.asarray(images).astype('float64')
y = np.asarray(labels).astype('float64')
system spec is:
OS: macOS Catalina
processor: 2.2 GHz 6-Core Intel Core i7 16 GB 2
memory: 16 GB 2400 MHz DDR4
Update:
getting the bellow error while running the code on 32 vCPUs, 120 GB memory.
MemoryError: Unable to allocate 14.1 GiB for an array with shape (1200, 1024, 1024, 3) and data type float32
You would have to provide some more info/details for an exact answer but, assuming that this is a memory error(incredibly likely, size of the images on disk does not represent the size they would occupy in memory, so that is irrelevant. In 100% of all cases, the images in memory will occupy a lot more space due to pointers, objects that are needed and so on. Intuitively I would say that 16GB of ram is nowhere nearly enough to load 7GB of images. It's impossible to tell you how much you would need but from experience I would say that you'd need to bump it up to 64GB. If you are using Keras, I would suggest looking into the DirectoryIterator.
Edit:
As Cris Luengo pointed out, I missed the fact that you stated the size of the images.

Why are large numpy arrays 64-byte aligned but not smaller ones

The following code:
prev=[]
addresses=[]
for i in range(10000):
a = np.ones(x).astype(np.float32)
prev.append(a)
address = a.__array_interface__['data'][0]
assert(address % 64 == 0)
assert((address not in addresses))
addresses.append(address)
Will not raise an assertionError for values of x > 252 suggesting that arrays bigger than 253, (or bigger than 505 when using float16) are aligned differently to smaller arrays. What is the reason for this?
I am on a OSX (Intel(R) Core(TM) i7-6920HQ CPU # 2.90GHz) running numpy 1.12.1
Your test loop isn't accomplishing exactly what you expect. Since only one array exists in memory at a time, it's quite possible - indeed LIKELY - that new ones will be allocated at the same memory address as the one just freed. You'd have to do something like append the arrays to a list (thus making them all exist in memory simultaneously) to actually test 10000 distinct allocations.
However, I can easily believe that you're seeing a real effect, as it's perfectly reasonable for a memory allocator to use different strategies based on the size of the block being allocated. For example, at some point the allocator may stop trying to use memory it already has, and start requesting entire memory pages directly from the operating system. Once that threshold is reached, you'd find that everything is aligned on a much higher power-of-2 boundary than 64 - perhaps 4096. You seem to be hitting some intermediate threshold at 1024 bytes (including overhead), it might be interesting to test for 128/256/512/1024 byte alignment.
Here is my guess: Using aligned memory typically involves allocating a larger block, and then releasing the upfront bytes that are allocated before the alignment boundary.
This is insignificant for large arrays, but for small arrays the fragmentation and overhead introduced likely outweights the benefits.

Finding the correct alignment for host visible memory

ppData points to a pointer in which is returned a host-accessible
pointer to the beginning of the mapped range. This pointer minus
offset must be aligned to at least
VkPhysicalDeviceLimits::minMemoryMapAlignment.
I want to allocate a Vec3 float in a uniform buffer. A Vec3 float is 12bytes big.
VkMemoryRequirements { size: 16, alignment: 16, memory_type_bits: 15 }
Vulkan reports that it has to be aligned to 16 bytes, which means that the size of the allocation is now 16 instead of 12. So Vulkan already handled this for me.
minMemoryMapAlignment on my GPU is 64 bytes. What exactly does this mean for my allocation? Does this mean that I can not use the size from a VkMemoryRequirements for my allocation? And instead of allocating 16bytes here, I would have to allocate 64bytes?
Update:
For a 12 byte allocation with a 16 byte alignment and 64 bytes minMemoryMapAlignment. I would still allocate only 16 bytes and then call:
vkMapMemory(device, memory, 0, 16, 0, &mapped);
But the ptr returned from vkMapMemory is actually not 16 bytes but 64 bytes wide? And all the relevant data is in the first 12 bytes and the rest is just "padded" memory? So in practice this basically means that I don't need to use minMemoryMapAlignment at all?
There is nothing in the spec that restricts the size of the allocation like that. The paragraph you quoted means that the mapping will be aligned to minMemoryMapAlignment and you can then tell the compiler to use aligned memory accesses when accessing it. What will happen is that when the memory is mapped the later 48 bytes are wasted space in the host's memory space. That is unlikely to matter though.
This is why people keep saying to allocate larger blocks and subdivide them as needed. That way you can put 4 of those vkBuffers into a single 64 byte allocation (which you will need if you want to pipeline the rendering).
It's highly unlikely that that single vec3 is the only thing you need memory for, so take a look at your other allocations and see which ones you can combine.