JProfiler: Trying to find memory leak - jprofiler

My application needs about 10 GByte of RAM for specific input, where for regular inputs roughly 1 GByte is enough. Closer analysis with JProfiler reveals that (after GC) quite a lot of memory is used by standard classes from java.util.*:
LinkedHashMap$Entry, HashMap$Entry[], LinkedHashMap, HashMap$KeySet, HashMap$EntrySet, LinkedHashSet, TreeMap$Entry, and TreeMap (in this order), and related classes. The following entry is of a class in my own code, where the number of instances and the amount of used memory seems very reasonable.
In detail with a total heap usage of about 900 MByte I see the following Size entries in the All Objects view:
LinkedHashMap$Entry: 418 MByte
HashMap$Entry[]: 178 MByte
LinkedHashMap: 124 MByte
HashMap$KeySet: 15 MByte
The memory used by LinkedHashMap seems to be too high, even when considering that every LinkedHashSet is backed by a LinkedHashMap.
I recorded object allocation in JProfiler and observe the Allocation Hot Spots for LinkedHashMap. In there, I see entries that I do not understand:
The third entry shows a hot spot (with 6.5% of the allocated memory) named X.<init> where X is a class in my own code. The constructors of this method do not have anything to do with LinkedHashMap. Following this entry to Thread.run at the end shows a slow decline from 6.5% to 5.8% at Thread.run. What could be the problem with my code in X? Why is it shown here?
About 8% of allocated memory is shown in the hot spot named java.util.HashSet.iterator. Following this entry along the path with the highest percentage (first entry: 2.8%) I get several methods inside my code until, finally, java.lang.Thread.run is shown (with 2.8%). What does this mean? As far as I know the Thread.run method does not create instances of LinkedHashMap. What is the connection to the iterator method?
In general, how do I find the code that keeps references to (a lot of) LinkedHashMap objects? Using the Heap Walker I can only see lots of those instances, but do not see any pattern (even when observing the paths to the GC roots). In my experiments all instances seem to be in order.
Possibly important things:
My application constructs a result (for further processing) and, for this construction, the high memory usage is observed. The construction continually creates objects, so waiting for a stable point and then observing every single created LinkedHashMap object is not possible.
I have good computers available to debug (up to 48 cores and 192 GByte of RAM, possibly even more).
java version "1.7.0_13" (Java(TM) SE Runtime Environment (build 1.7.0_13-b20),
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode))
JProfiler 7.2.2 (Build 7157), licensed

In general, how do I find the code that keeps references to (a lot of) LinkedHashMap objects?
In the heap walker, select "LinkedHashMap" and create a new object set. Then switch to the "References" view and show the "Cumulated incoming references". There, you can analyze references for the entire object set.
As for your question about the allocation hot spots and why the Thread.run method is shown: These are back traces, they show how the hot spot has been invoked and all numbers on the nodes are contributions to the hot spot at the top. The deepest node will already always be an entry point, typically a Thread.run method or the main method.

Related

Are JVM heap/stack different than virtual address space heap/stack?

Memory is divided into "segments" called heap, stack, bss, data, and text. However, the JVM also has these concepts of stack and heap. So how are these two reconciled?
Are they different levels of abstraction, where main memory is one or two levels below the JVM, and whose "segments" maps naturally to JVM's "segments"? Since JVM is supposed to be a virtual computer, it seems to me like they emulate what happens underneath but at a higher level of abstraction.
Sounds to me like you've been reading a textbook or similar. All these terms generally have very precise definitions in books/lectures, but a lot less precise definitions in reality. Therefore what people mean when they say heap is not necessarily exactly the same as what a book etc. says.
Memory is divided into "segments" called heap, stack, bss, data, and text.
This is only true for a typical user space process. In other word this will be true for an everyday program written in c or similar, however it is not true for all programs, and definitely not true for the entire memory space.
When a program is executed the OS allocates memory for the various segments listed, except the the heap. The program can request memory from the OS while it is executing. This allows a program to use a different amount of memory depending on its needs. The heap refers to memory requested by the program usually via a function like malloc. To clarify the heap typically refers to a managed region of memory, usually managed with malloc/free. It is also possible to request memory directly from the OS, in an unmanaged fashion. Most people (Imo) would say this wouldn't count as part of the heap.
The stack is a data structure/segment which keeps track of local variables and function calls. It stores important information like where to return after a function call. In c or other "native" languages the stack is created by the OS and can grow or shrink if needed.
Java allows program to request memory during execution using new. Memory allocated to a java program using new is referred to as memory in the java heap. One could imagine that if you where implementing a Jvm you would use malloc behind the scenes of new. This would result in a java heap within a regular native heap. In reality "serious" jvms do not do this and interact directly with the OS for memory.
In Java the stack is created by the Jvm. One could imagine that this is allocated by malloc, but as with the heap this is likely not how real world jvms do it.
Edit:
A Jvm like hotspot. Would likely allocate memory directly from the OS. This memory would then get put into some kind of pool, from which it would be removed as needed. Reasons for needed memory would be needed includes new, or a stack that needs to grow.

Java 8: Why does Metaspace size increase but number of loaded classes stay the same?

In our app running on Jdk 8 we use VisualVM to track the usage of loaded classes and the usage of the metaspace.
At some point in time while our app is running we see that the number of loaded classes don't increase any more but the metaspace still increases in it's size while our program is running. So what else apart from classes is stored in metaspace, that could cause that?
While your program is running, some parts of your code may be determined as "hot" by HotSpot's JIT compiler. This will cause those parts to be transformed/compiled to native code, and also some other code may be inlined into it. This native code representation has to go somewhere, and it goes into the same place as other class metadata - the Metaspace.
It explains continuous growth you're seeing: hot parts are determined over time using a simple metric of how much times did that piece of code got executed. Over time more and more code pieces will be JIT'ed as they'll hit threshold set by -XX:CompileThreshold (defaults to 10000)
i am not sure but hier (http://java.dzone.com/articles/java-8-permgen-metaspace) i fund this
Garbage collection of the dead classes and classloaders is triggered once the class metadata usage reaches the “MaxMetaspaceSize”.
maybe is this the cause for increasing metaspace size.

Estimating available RAM left with safety margin in C (STM32F4)

I am currently developing application for STM32F407 using STM32CubeMx and Keil uVision. I know that dynamic memory allocation in embedded systems is mostly discouraged, but from spot to spot on internet I can find some arguments in favor of it.
Due to my inventors soul I wanted to try to do it, but do it safely. Let's assume I'm creating a dynamically allocated fifo for incoming UART messages, holding structs composed of the msg itself and its' length. However I wouldn't like to consume all the heap size doing so, therefore I want to check how much of it I have left: Me new (?) idea is to try temporarily allocating some big chunk of memory (say 100 char) - if it's successful, I accept the incoming msg, if not - it means that I'm running out of heap and ignore the msg (or accept it and dequeue the oldest). After checking I of course free the temp memory.
A few questions arise in my mind:
First of all, does it make sens at all? Do you think, basic on your experience, that it could be usefull and safe?
I couldn't find precise info about what exactly shares RAM in ES (I know about heap, stack and volatile vars) so my question is: providing that answer to 1. isn't "hell no go home", what size of the temp memory checker would you pick for the mentioned controller?
About the micro itself - it has 192kB RAM, however in the Drivers\CMSIS\Device\ST\STM32F4xx\Source\Templates\arm\startup_stm32f407xx.s file only 512B+1024B are allocated for heap and stack - isn't that very little, leaving the whooping, remaining 190kB for volatile vars? Would augmenting the heap size to, say 50kB be sensible? If yes, do I do it directly in this file or it's a better practice to do it somewhere else?
Probably for some of you "safe dynamic memory" and "embedded" in one post is both schocking and dazzling, but keep in mind that this is experimenting and exploring new horizons :) Thanks and greetings.
Keil uVision describes only the IDE. If you are using KEil MDK-ARM which implies ARM's RealView compiler then you can get accurate heap information using the __heapstats() function.
__heapstats() is a little strange in that rather than simply returning a value it outputs heap information to a formatted output stream facilitated by a function pointer and file descriptor passed to it. The output function must have an fprintf() like interface. You can use fprintf() of course, but that requires that you have correctly retargetted the stdio
For example the following:
typedef int (*__heapprt)(void *, char const *, ...);
__heapstats( (__heapprt)fprintf, stdout ) ;
outputs for example:
4180 bytes in 1 free blocks (avge size 4180)
1 blocks 2^11+1 to 2^12
Unfortunately that does not really achieve what you need since it outputs text. You could however implement your own function to capture the data in memory and parse the result. You may only need to capture the first decimal digit characters and discard anything else, except that the amount of free memory and the largest allocatable block are not necessarily the same thing of course. Fragmentation is indicated by the number or free blocks and their average size. You can perhaps guarantee to be able to allocate at least an average sized block.
The issue with dynamic allocation in embedded systems are to do with handling memory exhaustion and, in real-time systems, the non-deterministic timing of both allocation and deallocation using the default malloc/free implementations. In your case you might be better off using a fixed-block allocator. You can implement such an allocator by creating a static array of memory blocks (or by dynamically allocating them from the heap at start-up), and placing a pointer to each block on a queue or linked list or stack structure. To allocate you simply remove a pointer from the queue/list/stack, and to free you place a pointer back. When the available blocks structure is empty, memory is exhausted. It is entirely deterministic, and because it is your implementation can be easily monitored for performance and capacity.
With respect to question 3. You are expected to adjust the heap and system stack size to suit your application. Most tools I have used have a linker script that automatically allocates all available memory not statically allocated, allocated to a stack or reserved for other purposes to the heap. However MDK-ARM does not do that in the default linker scripts but rather allocates a fixed size heap.
You can use the linker map file summary to determine how much space is unused and manually expand the heap. I usually do that leaving a small amount of unused space to account for maintenance when the amount of statically allocated data may increase. At some point however; you end up running out of memory, and the arcane error messages from the linker may not make it obvious that your heap is just too big. It is possible to override the default linker script and provide your own, and no doubt possible then to automatically size the heap - though I have never taken the trouble to try it.
Okay I have tested my idea with dynamic heap free space checking and it worked well (although I didn't perform long-run tests), however #Clifford answer and this article convinced me to abandon the idea of dynamic allocation. Eventually I implemented my own, static heap with pages (2d array), occupied pages indicator (0-1 array of size of number of pages) and fifo of structs consisting of pointer to the msg on my static heap (actually just the index of the array) and length of message (to determine how many contiguous pages it occupies). 95% of msg I receive should take up only one page, 5% - 2 or 3 pages, so fragmentation is still possible, but at least I keep a tight rein on it and it affects only the part of memory assigned to this module of the code (in other words: the fragmentation doesn't leak to other parts of the code). So far it has worked without any problems and for sure is faster because the lookup time is O(n*m), n - number of pages, m - the longest page possible, but taking into consideration the laws of probability it goes down to O(n). Moreover n is always a lot smaller the number of all allocation units in memory, so way less to look for.

How can I change maximum available heap size for a task in FreeRTOS?

I'm creating a list of elements inside a task in the following way:
l = (dllist*)pvPortMalloc(sizeof(dllist));
dllist is 32 byte big.
My embedded system has 60kB SRAM so I expected my 200 element list can be handled easily by the system. I found out that after allocating space for 8 elements the system is crashing on the 9th malloc function call (256byte+).
If possible, where can I change the heap size inside freeRTOS?
Can I somehow request the current status of heap size?
I couldn't find this information in the documentation so I hope somebody can provide some insight in this matter.
Thanks in advance!
(Yes - FreeRTOS pvPortMalloc() returns void*.)
If you have 60K of SRAM, and configTOTAL_HEAP_SIZE is large, then it is unlikely you are going to run out of heap after allocating 256 bytes unless you had hardly any heap remaining before hand. Many FreeRTOS demos will just keep creating objects until all the heap is used, so if your application is based on one of those, then you would be low on heap before your code executed. You may have also done something like use up loads of heap space by creating tasks with huge stacks.
heap_4 and heap_5 will combine adjacent blocks, which will minimise fragmentation as far as practical, but I don't think that will be your problem - especially as you don't mention freeing anything anywhere.
Unless you are using heap_3.c (which just makes the standard C library malloc and free thread safe) you can call xPortGetFreeHeapSize() to see how much free heap you have. You may also have xPortGetMinimumEverFreeHeapSize() available to query how close you have ever come to running out of heap. More information: http://www.freertos.org/a00111.html
You could also define a malloc() failed hook (http://www.freertos.org/a00016.html) to get instant notification of pvPortMalloc() returning NULL.
For the standard allocators you will find a config option in FreeRTOSConfig.h .
However:
It is very well possible you run out of memory already, depending on the allocator used. IIRC there is one that does not free() any blocks (free() is just a dummy). So any block returned will be lost. This is still useful if you only allocate memory e.g. at startup, but then work with what you've got.
Other allocators might just not merge adjacent blocks once returned, increasing fragmentation much faster than a full-grown allocator.
Also, you might loose memory to fragmentation. Depending on your alloc/free pattern, you quickly might end up with a heap looking like swiss cheese: Many holes between allocated blocks. So while there is still enough free memory, no single block is big enough for the size required.
If you only allocate blocks that size there, you might be better of using your own allocator or a pool (blocks of fixed size). Thaqt would be statically allocated (e.g. array) and chained as a linked list during startup. Alloc/free would then just be push/pop on a stack (or put/get on a queue). That would also be very fast and have complexity O(1) (interrupt-safe if properly written).
Note that normal malloc()/free() are not interrupt-safe.
Finally: Do not cast void *. (Well, that's actually what standard malloc() returns and I expect that FreeRTOS-variant does the same).

boost::serialization high memory consumption during serialization

just as the topic suggests I've come across a slight issue with boost::serialization when serializing a huge amount of data to a file. The problem consists of the memory footprint of the serialization part of the application taking around 3 to 3.5 times the memory of my objects being serialized.
It is important to note that the data structure I have is a three dimensional vector of base class pointers and a pointer to that structure. Like this:
using namespace std;
vector<vector<vector<MyBase*> > >* data;
This is later serialised with a code analog to this one:
ar & BOOST_SERIALIZATION_NVP(data);
boost/serialization/vector.hpp is included.
Classes being serialised all inherit from "MyBase".
Now, since the start of my project I've used different archives for serialization from typical binary_archive, text, xml and finally polymorphic binary/xml/text. Every single one of these acts exactly the same way.
Typically this wouldn't be a problem if I had to serialize small amounts of data but the number of classes I have are in the milions (ideally around 10 milion) and the memory usage as I've been able to test it shows consistently that the memory allocated by boost::serialization part of the code is around 2/3 of the application whole memory footprint while writing the file.
This amounts to around 13.5 GB of RAM taken for 4 milion objects where the objects themselves take 4.2GB. Now this is as far as I've been able to take my code since I don't have access to a machine with more than 8GB of physical RAM. I should also note that this is a 64bit application being run on a Windows 7 professional x64 edition but the situation is similar on an Ubuntu box.
Anyone has any idea how I would go about troubleshooting this as it is unacceptable for me to have such high memory requirements for an application that will not use as much memory while running as it does while serializing.
Deserialization isn't as bad, as it allocates around 1.5 times the needed memory. This is something I could live with.
Tried turning tracking off with boost::archive::archive_flags::no_tracking but it acts exactly the same.
Anyone have any idea what I should do?
Using valgrind I found that the main reason of memory consumption is a map inside the library to track pointers. If you are certain that you do not need pointer tracking ( it means you are sure that there is no pointer aliasing) disable tracking. You can find here the main concepts of disable tracking. In short you must do something like this:
BOOST_CLASS_TRACKING(vector<vector<vector<MyBase*> > >, boost::serialization::track_never)
In my question I wrote a version of this macro that you could disable tracking of a template class. This must have a significant impact on your memory consumption.
Also notice that there are pointers inside any containers If you want tracking never you must disable tracking of them too. Currently I could not find any way to do this properly.