Can / how... I define 2 matrices to be in virtue memory so I can use the RAM to perform matrix multiplication?
Is video RAM separate from main memory? Or can I use that to do matrix multiplication... the advantage may be speed, if so?
thanks.
All allocated memory will be in "virtual memory". If you malloc() a hunk of memory, that is "virtual" memory.
But it sounds like you needs something faster? Do you have performance analysis that indicates a problem?
In any case, you'll likely want to look into OpenCL, if you really need that extra speed.
Related
I have read that malloc has multiple implementations which are platform depended.
How does it work in an embedded device in bare metal programming?
Let's suppose we have an mcu with 256KB FLASH memory and 64KB RAM.
How does it know how much available RAM there is from my program?
For bare metal systems, you'll have a specific segment allocated in the linker script, often called .heap. There is no such thing as memory sharing between processes, meaning that the heap must have a fixed maximum size and therefore is pretty useless in general. malloc doesn't know a thing about how much RAM your program uses since there is no desktop OS in sight.
Your RAM is divided into .stack, .data, .bss and .heap, each with its own fixed maximum size. More about these segments here: https://electronics.stackexchange.com/a/237759/6102. In a typical bare metal MCU application, most of the RAM will be reserved for .data and .bss. You will have something from 128 bytes up to several kb reserved for the stack. You will typically not have a heap at all - but if you do, it will sit there and take up a fixed amount of x kb no matter how much of it you actually use.
malloc in itself could be implemented in different ways indeed. Either you include a "header" together with each allocated segment, the header stating the allocated size and potentially the address of the next available free segment. Or you could implement it as a look-up table where each item is a pointer to the first element and the size.
None of this is particularly relevant, since you shouldn't be using heap allocation in embedded systems. The main reason being that it doesn't make any sense. You don't want arbitrary behavior, you want deteministic behavior. You want to allocate x amount of memory for the worst case and if a heap was to be used it would have to be at least that large anyway, so you gain nothing but bloat from using a heap. Then comes all the usual problems with allocation overhead, fragmentation and leaks.
For bare metal/RTOS applications, do yourself a favour and delete .heap from your linker script, then forget that you ever heard about malloc. A MCU is not a PC.
Is there any technical reasons to limit the maximum number of memory allocations?
Check out vkAllocateMemory manual page. It says:
The maximum number of valid memory allocations that can exist simultaneously within a VkDevice may be restricted by implementation- or platform-dependent limits. If a call to vkAllocateMemory would cause the total number of allocations to exceed these limits, such a call will fail and must return VK_ERROR_TOO_MANY_OBJECTS.
OpenGL doesn't limit allocations, DirectX 11/12 neither. So why should Vulkan do so?
As explained here, this is primarily an OS limitation.
OpenGL doesn't limit allocations, DirectX 11/12 neither
Oh, they do. They just don't tell you about it.
OpenGL and DX11 drivers tend to internally do large GPU (virtual) allocations and perform sub-allocations from those when you allocate memory. Thus, they can create the illusion that you can perform more hardware allocations. But the limitation is still there.
As for DX12, I'm fairly sure that if you try to allocate more than 4096 heaps, you will find CreateHeap returning errors.
Vulkan is simply the API that's up-front about the existence of the limitation.
With Vulkan, this is simply a problem that should never arise. If you're performing over a thousand individual memory allocations, your memory allocation scheme is wrong. You're supposed to allocate a few large slabs of memory, then use sub-sections of them for your textures and buffers.
I'm creating a list of elements inside a task in the following way:
l = (dllist*)pvPortMalloc(sizeof(dllist));
dllist is 32 byte big.
My embedded system has 60kB SRAM so I expected my 200 element list can be handled easily by the system. I found out that after allocating space for 8 elements the system is crashing on the 9th malloc function call (256byte+).
If possible, where can I change the heap size inside freeRTOS?
Can I somehow request the current status of heap size?
I couldn't find this information in the documentation so I hope somebody can provide some insight in this matter.
Thanks in advance!
(Yes - FreeRTOS pvPortMalloc() returns void*.)
If you have 60K of SRAM, and configTOTAL_HEAP_SIZE is large, then it is unlikely you are going to run out of heap after allocating 256 bytes unless you had hardly any heap remaining before hand. Many FreeRTOS demos will just keep creating objects until all the heap is used, so if your application is based on one of those, then you would be low on heap before your code executed. You may have also done something like use up loads of heap space by creating tasks with huge stacks.
heap_4 and heap_5 will combine adjacent blocks, which will minimise fragmentation as far as practical, but I don't think that will be your problem - especially as you don't mention freeing anything anywhere.
Unless you are using heap_3.c (which just makes the standard C library malloc and free thread safe) you can call xPortGetFreeHeapSize() to see how much free heap you have. You may also have xPortGetMinimumEverFreeHeapSize() available to query how close you have ever come to running out of heap. More information: http://www.freertos.org/a00111.html
You could also define a malloc() failed hook (http://www.freertos.org/a00016.html) to get instant notification of pvPortMalloc() returning NULL.
For the standard allocators you will find a config option in FreeRTOSConfig.h .
However:
It is very well possible you run out of memory already, depending on the allocator used. IIRC there is one that does not free() any blocks (free() is just a dummy). So any block returned will be lost. This is still useful if you only allocate memory e.g. at startup, but then work with what you've got.
Other allocators might just not merge adjacent blocks once returned, increasing fragmentation much faster than a full-grown allocator.
Also, you might loose memory to fragmentation. Depending on your alloc/free pattern, you quickly might end up with a heap looking like swiss cheese: Many holes between allocated blocks. So while there is still enough free memory, no single block is big enough for the size required.
If you only allocate blocks that size there, you might be better of using your own allocator or a pool (blocks of fixed size). Thaqt would be statically allocated (e.g. array) and chained as a linked list during startup. Alloc/free would then just be push/pop on a stack (or put/get on a queue). That would also be very fast and have complexity O(1) (interrupt-safe if properly written).
Note that normal malloc()/free() are not interrupt-safe.
Finally: Do not cast void *. (Well, that's actually what standard malloc() returns and I expect that FreeRTOS-variant does the same).
Is it possible to increase performance by running on a GPU for the algorithm with the following properties:
There are hundreds and even thousands of independent threads, which do not require any synchronization during calculations
Each thread has a relatively small (less than 200Kb) local memory region containing thread-specific data. Read/Write
Each thread accesses a large memory block (hundreds of megabytes and even gigabytes). This memory is read-only
For each access to the global memory there will be at least two accesses to the local memory
There will be a lot of branches in the algorithm
Unfortunately the algorithm is rather complicated to be show here.
My instinct is to use texture memory aggressively. The caching benefits will beat uncoalesced global memory reads by a mile.
The writes you may need to add some padding etc. to avoid bank conflicts.
The reliance on hundreds of meg or gigs of data is somewhat concerning. Can you carve it up somehow? Hope you have a big beefy Tesla/Quadro w/ oodles of RAM.
That said, the name of game for CUDA optimization is always to experiment, profile/measure, rinse and repeat.
Before I start, please remember that there are two layers of parallelism in CUDA: blocks and threads.
There are hundreds and even thousands of independent threads, which do
not require any synchronization during calculations
Since you can launch as many as 65535 blocks per dimension, you can treat each block in cuda to be equivalent to a "thread" of yours.
Each thread has a relatively small (less than 200Kb) local memory
region containing thread-specific data. Read/Write
Unfortunately most cards have a shared memory limit of 16k per block. So if you can figure out how to handle with this lower limit, great. If not, you will need to use global memory accesses..
Each thread accesses a large memory block (hundreds of megabytes and
even gigabytes). This memory is read-only
You can not bind such large arrays to textures or constant memory. So in a given block, try to make the threads read contiguous chunks of data for the best performance.
For each access to the global memory there will be at least two
accesses to the local memory There will be a lot of branches in the
algorithm
Since you are essentially replacing a single thread in your original implementation with a block in cuda, you may want to revise the code a little bit to try and implement a parallel version of the "per thread code" too.
This may not be clear at first glance, but think it through a little. Any algorithm that has hundreds / thousands of independent parts with no synchronization needed is great for a parallel implementation, even with cuda.
The CUDA programming guide states that
"Bandwidth is one of the most important gating factors for performance. Almost all changes to code should be made in the context of how they affect bandwidth."
It goes on to calculate theoretical bandwidth which is in the order of hundreds of gigabytes per second. I am at a loss as to why how many bytes one can read/write to global memory is a reflection of how well optimised a kernel is.
If I have a kernel which does intensive computation on data stored in shared memory and/or registers, with only a single read at the start and write out at the end from and to global memory, surely the effective bandwidth will be small, while the kernel itself may be very efficient.
Could any one further explain bandwidth in this context?
Thanks
most all nontrivial computational kernels, in CPU and GPU land, memory bound.
GPU has very high computational intensity and throughput, but access to main memory is very slow and has high latency, few hundred cycles per read/store versus four cycles for mmany arithmetic operations.
It sounds like your kernel is computation bound, so your luck. However you still have to watch out for shared memory bank conflict, which can serialize portions of code unexpectedly.
Most kernels are memory bound so maximising memory throughput is critical. If you're lucky enough to have a compute bound kernel then optimizing for computation is generally easier. You do need to look out for divergence and you should still ensure you have enough threads to hide memory latency.
Check out the Advanced CUDA C presentation for more information, including some tips for how to compare your realised performance with theoretical performance. The CUDA Best Practices Gude also has some good information, it's available as part of the CUDA toolkit (download from the NVIDIA site).
Typically kernels are fairly small and simple and perform the same operation on a lot of data. You might have a bunch of kernels that you invoke in sequence to perform some more complex operation (think of it as a processing pipeline). Obviously the throughput of your pipeline will depend both on how efficient your kernels are and whether you are limited by memory bandwidth in any way.