boost::serialization high memory consumption during serialization - serialization

just as the topic suggests I've come across a slight issue with boost::serialization when serializing a huge amount of data to a file. The problem consists of the memory footprint of the serialization part of the application taking around 3 to 3.5 times the memory of my objects being serialized.
It is important to note that the data structure I have is a three dimensional vector of base class pointers and a pointer to that structure. Like this:
using namespace std;
vector<vector<vector<MyBase*> > >* data;
This is later serialised with a code analog to this one:
ar & BOOST_SERIALIZATION_NVP(data);
boost/serialization/vector.hpp is included.
Classes being serialised all inherit from "MyBase".
Now, since the start of my project I've used different archives for serialization from typical binary_archive, text, xml and finally polymorphic binary/xml/text. Every single one of these acts exactly the same way.
Typically this wouldn't be a problem if I had to serialize small amounts of data but the number of classes I have are in the milions (ideally around 10 milion) and the memory usage as I've been able to test it shows consistently that the memory allocated by boost::serialization part of the code is around 2/3 of the application whole memory footprint while writing the file.
This amounts to around 13.5 GB of RAM taken for 4 milion objects where the objects themselves take 4.2GB. Now this is as far as I've been able to take my code since I don't have access to a machine with more than 8GB of physical RAM. I should also note that this is a 64bit application being run on a Windows 7 professional x64 edition but the situation is similar on an Ubuntu box.
Anyone has any idea how I would go about troubleshooting this as it is unacceptable for me to have such high memory requirements for an application that will not use as much memory while running as it does while serializing.
Deserialization isn't as bad, as it allocates around 1.5 times the needed memory. This is something I could live with.
Tried turning tracking off with boost::archive::archive_flags::no_tracking but it acts exactly the same.
Anyone have any idea what I should do?

Using valgrind I found that the main reason of memory consumption is a map inside the library to track pointers. If you are certain that you do not need pointer tracking ( it means you are sure that there is no pointer aliasing) disable tracking. You can find here the main concepts of disable tracking. In short you must do something like this:
BOOST_CLASS_TRACKING(vector<vector<vector<MyBase*> > >, boost::serialization::track_never)
In my question I wrote a version of this macro that you could disable tracking of a template class. This must have a significant impact on your memory consumption.
Also notice that there are pointers inside any containers If you want tracking never you must disable tracking of them too. Currently I could not find any way to do this properly.

Related

Weird memory corruption issue, FreeRTOS, STM32F777II

I am currently working on an embedded firmware development which uses FreeRTOS running on an STM32F777II microcontroller. Resource wise, I have around 10 tasks (total sum of stack size will be under 40 KByte) at the same priority, around 4 queues of 1KByte each, 4 binary semaphores. I know this would be an incomplete question without posting the actual code, but I really do not have any specific portion in my firmware that I think will be worth sharing related to my issue. I have a ton of business logic in my code which I cannot fully share as well.
I have a struct which consists of multiple char and int arrays of a specific length. 4 of the tasks uses these structures each. Each structure consumes around 15KByte of space and is defined in the global space of the FreeRTOS environment, not local to a task. The structs are allocated statically only and not dynamically on runtime. And since I initialize few members of the struct when declaring, so they go to the .data section only if I am not mistaken. Until now, there had been absolutely no problem whatsoever in my code and it worked 100% without any issue at all. Now I recently had a requirement where I had to add the same stuct to 2 more tasks. So, I added this 15KByte stuct to one of my tasks, basically just allocated and initialized and did not do any processing in any of the tasks. Observed no problems, nothing, no data corruption, nothing. Now when I allocated one more struct variable of the same type only, what I observe is data corruption in a lot of other places in my project. Some of the queues stopped working correctly and showed garbage data when read. Some of the other buffers also showed data corruption. I am really not sure why just one more variable allocation of this struct is triggering a lot of data corruption at other places in my project. If I remove this one allocation, everything goes back to normal. My MCU has 512KB of RAM and as per the IDE's build analyzer feature, it showed below 40% RAM usage, so what is triggering this issue, any suggestions to try? Could be because of some overlapping of .data or .bss sections or something? I did not observe any stack overflows or hard faults in the system during this.
For a quick resolution,
I randomly just disabled the D-cache by commenting out the function:
SCB_EnableDCache();
and voila, everything started to function correctly as it should without any instances of data corruption.
For long run and correct resolution:
Looks like there are some latent issues with my coding. I need to review the memory use, and regions of memory with different properties. Look at the buses, review any DMA usage, and MPU memory settings. Also, review the correct usage of volatile memory directives, thread-safe operation, and cache-coherency issues. Also, Use memory fencing and cache flushing as appropriate.
More details: Level 1 cache on STM32F7 Series and STM32H7 Series

Physical Memory and Virtual Memory data allocation behavior

Im interested in understanding how a computer allocates variables for physical memory vs files in virtual memory ( such as on a hard drive ), in terms of how does the computer determine know where to put data. It almost seems random in both memory storage types, but its not because it simply can't put data at a memory address or sector (any location) of a hard drive that's occupied or allocated for another process already. When I was studying how Norton's speed disk ( a program that de-fragments files on hard drives ) on my old W95 system, I noticed from the program's representation of hard drive's data ( a color coded visual map of different data types, e.g. swap files were always first at the top.), consisting of many files spread out all over the hard drive with empty unused areas. In addition some of these areas, I saw what looked like a mix of data and empty space showed a spotty pattern. I want to think its random for that to happen. Like wise, when I was studying the memory addresses of a simple program I wrote in C, I noticed that each version of my program after recompiling it after changes - showed different addresses for segments and offsets. I was expecting the computer to use the same address when I recompiled it. Sometimes the same address would be used, other times it was different. Again, I want to think its random also for memory locations to be chosen by programs. I thought that memory allocation or file writing was based on the first empty space available, written in a contiguous manner.
So my question is, I want to know how and what is it in the logic works of a common computer, that decides where it writes its data in such a arbitrary manner for either type of location (physical RAM or Dynamic )? What area of computer science (if not assembly language) would I need to study that would explains this, almost random behavior?
Thanks in Advance
Something broader and directly from computer science would be a linked list. http://en.wikipedia.org/wiki/Linked_list
Imagine if you had a linked list and simply added items to the end, these items might live linearly in memory or disk or whatever somewhere. But as you remove some items in the middle of the list by having say item number 7 point at item number 9 eliminating item number 8. As with memory allocation for allocs or virtual memory or hard drive sector allocation, etc how fast you fragment your storage has to do with the algorithm you use for allocating the next item.
file systems can/do use a link list type scheme to keep track of what sectors are tied to a single file. it is fast and easy to use the link list but deal with fragmentation. A much slower method would be to have no fragmentation but be constantly copying/moving files around to keep them on linear sectors.
malloc() allocation schemes and MMU allocation schemes also fall under this category. Basically any time you take something, slice it up into fractions and put a virtual interface in front of those fractions to give the appearance to the programmer/user that they are linear. Malloc() (not counting the virtual memory via the MMU) is the other way around allocating a number of linear chunks of those fractions to meed the alloc need, and having an alloc/free scheme that attempts to keep as many large chunks available, just in case, a bad malloc system is one where you have half of your memory free but the maximum malloc that works without an out of memory error is a malloc of a small fraction of that memory, say you have a gig free and can only allocate 4096 bytes.
You should look at virtual memory and TLB (translation lookaside buffer) or paging.
It is not trivial to implement virtual memory and paging. The performance of your whole system depends on it. If it's not done properly your system will thrash.
It is early morning here so Wikipedia will have to do for now: https://en.m.wikipedia.org/wiki/Translation_lookaside_buffer
EDIT:
Those coloured spots you saw in your defrag were chunks on your HDD. Each chunk is of some specified size. Depending on how fragmented your HDD is, you might have portions of your HDD that look like this:
*-*-***-***-*
where * means full, and - means empty
This (above) could be part of one application/file or multiple files; I will assume one file is split across those to simplify my example. At the end of each * there is a pointer to the next location where the next * chunk is (this is called a linked list). The more fragmented your HDD is (or memory) the more of these pointers to next chunk you will have. This in turn uses more space for next pointers instead of using space for data and the result is more overhead when reading that data. If this is a file on disk, you will have multiple seeks (which are bad because they're slow) if your data is not grouped together (locality principle). When you use defrag, it moves and groups all chunks together (as best as it can).
*-*-***-***-*
becomes
*********----
The OS decides paging and virtual memory addressing (and such). TLB is a hardware (a cache) that aids this process (it maps physical memory to virtual memory addresses for fast look up). The CPU communicates with the TLB via MMU
To answer your questions
You should study operating systems.
Yes the locations where to place your files on HDD are decided by the OS. If you deleted a file and download it again, there is no guarantee it will be placed in the same location-most likely not.
A nice summary of all these components and principles I mentioned here work: Click Here. It's a ppt with slides from a Real Time Operating Systems book (if I'm not mistaken the same exact one I used)

Virtual memory beyond page tables

I am working on a research project to develop an OS for a many-core(1000+) chip. we are looking into implementing a virtual memory type system for memory permissions (read/write/execute) that would allow memory to be safely shared across cores.
basically we want a system that would allow us to mark a 'page' as being readable by some subset of cores writeable by another...etc. we are not going to be doing address translation (at least at this point) but we need a way to efficiently set and query permissions. it is going to be a software filled datastructure with a simple TLB style cache.
Our intuition is that simply replicating page tables for each core will be too expensive (in terms of memory usage).
what datastructures would be efficient for this kind of problem?
thanks
Have you looked at how common multi-core (2-12 core) CPU's address this problem?
Do you know where/when/why/how the solution that is used in these common multi-core CPU's -- will not scale to a 1,000+ cores?
In other words -- can you quantify what's wrong with the existing solution, which is working, and has been working, with common CPU's whose core count <= 12 ?
If you know that -- then the answer is closer than you think, because it just requires understanding how AMD/Intel solved the problem on a lesser scale -- and what's needed to make their solution work on a greater one (Maybe more memory for tables, algorithm tweaks, etc.)
Look at AMD's/Intel's data structures -- then build a software simulator for 1,000+ cores with those data structures, and see where/when/why and how your simulation fails -- if it fails...
Ideally build your simulator with a user-selectable number of cores, then TEST, TEST, TEST with different amounts of cores -- working your way up, noting bottlenecks along the way.
Your simulator should work EXACTLY as well as AMD (if you're using AMD data structures) or Intel (if you're using Intel data structures) -- at the same core count as one of their chips... because it should prove that THEY (AMD/Intel) are doing what they're doing correctly (because they are), and because that will help prove that your simulation program is doing it's simulation correctly -- at a specific number of cores.
Wishing you luck!

Points to be considered while designing or coding for lesser footprint deliverables

Please post the points one should keep in mind while designing or coding for lesser footprint deliverables for embedded systems.
I am not giving compiler or platform details, as I want generic information. But, any specific information on Linux based OS is also welcome.
Depends on how low you want to get. I'm currently coding for fiscal printers, and there's no OS, and the main rule is no dynamic memory allocation. The funny thing is that I still convinced the crew to code fully modern C++ ;).
Actually there are a few rules we decided upon:
no dynamic allocation
hence, no STL
no exception handling (obvious reasons)
There isn't a general answer, only ones specific to language/platform ... but
Small memory footprint ...
Don't use Java, C#/mono, PHP, Perl, Python or anything with garbage collection
Get as close to the metal as feasible, Use C
Do alot of profiling to see where memory is getting allocated, if you are using dynamic allocation
Ensure you prevent heap-fragmentation by allocating sensible chunks and sizes of the heap
Avoid recursive functions especially those that use malloc(). Better allocating a chunk and passing a pointer around.
use free() ;)
Ensure your types are no bigger than required
Turn on compiler optimizations
There will be more.
for real low footprint consider doing Assembly directly.
We all know that Hello World in C or C++ is 20kb+(because of all the default libraries which get linked). In Assembly this overhead is gone. As pointed out in the comments one can reduce the standard libraries quite a bit. However, the fact remains that the code density you can get when coding assembly is much higher than a compiler will generate from a higher language. So for code where every byte matters, use assembly.
also when programming on devices with less capable processors, programming in assembly language might be your only way to do make the program fast enough for it to be realtime enough to (for instance) control machines
When faced with such constraints, it is advisable to pre-allocate memory in order to guarantee that the system will work under load. A design pattern such as "object pooling" can be used to share resources within the system.
The C language enables tight resource (i.e. memory & compute cycles) control. It should be strongly considered.
Avoid recursion as it is easy to abuse and can result in stack overflow conditions.

Why do Cocoa apps use so much memory?

Even the standard blank-window Cocoa app that gets built when you make a new Cocoa project in Xcode uses almost 6 MB of memory. What's the reason for this? Is it possible to make an app use less, or does OS X simply manage memory differently for Cocoa apps?
Not that I'm complaining. I know that performance "hardly matters anymore" (edit: what I mean is, it matters less than readability/maintainability/the programmer's time). I'm just curious.
OS X does a lot of magic with shared memory and copy-on-write pages, so chances are that it doesn't take that much physical RAM for every application.
You can check exactly how memory blocks are mapped by running:
sudo vmmap <PID of the process>
Depends on the all the framework (APIs) you use. Combine that with the VM allocations done by low level ops.
It's only worth trying to reduce the heap alloc (total), as well as the resident size of the code. Making sure your data structs are allocated efficiently and trying to compile with the ever-so-famous "-Os" optimization flag (size optimization). There isn't much you can do about the VM eaten by Cocoa. I wouldn't really worry about it.
This is clearly a 'WTF' moment for developers in general. The question is usually - why does my trivial application use up so much memory.
The answer is down to the underlying framework. You could argue that 6MB is too much, but really, it is nothing.
It's not rare to see computers come with 2GB of memory these days. The stock IMAC is 4GB. The whole point of the computer industry is to use up all the resources a machine has so that it continues to evolve.
Yes you should avoid ineffecincies where possible (Don't load up a 5million point array at start up for instance). But unless your beta demonstrates you fudged up just keep it on the list of todo's.
I'm a bit out on a limb here, but I guess it's because all the libraries that get added have to do quite a bit of setting up and there is no need to garbage collect, so they simply get to waste memory; plus, even if all memory got autoreleased, it would wait until the first idle event, which is after the creation of the window. Delete unneeded libraries/frameworks, or force a garbage collect somewhere after loading the window from the nib and see how much it goes down if you're so concerned.
I am not concerned about it. Some of the memory might be returned later, and the rest is the price you pay for a powerful framework.
A factor which is not directly linked to cocoa but is valid to frameworks in general is that the overhead is not linear. There is usually a fixed and a variable "price", in terms of overhead, to use the framework.
When you create a simple blank window, the fixed overhead is crushing, but when you create an application with tens of windows, dialogs, controls and all, the initial fixed overhead becomes negligible, against the size of the application itself.