I am running this code on AIX 6.1
while(true)
{
int a = rand(); //generate a random integer value
void* test = malloc(a*a); //allocate large chunk of memory block
usleep(3000000); //sleep for 3 sec
free(test); // release memory block
}
using MALLOCTYPE=buckets
My observation is
Resident set size(real memory) and data section size for process is continuously increasing. This is check by command ps v PID
pg sp value shown in topas for process is slowly increasing.
Can someone justify this behavior.
On free, memory is not released to AIX os, but it is reserved for reuse. With MALLOCOPTIONS=disclaim, free releases memory back to AIX os and their is not increase in memory utilization. But with MALLOCOPTIONS=disclaim, CPU utilization is almost 2-3 times greater.
Related
I am using a Raspberry Pi 3 and basically stepped over a little tripwire.
I have a very big and complicated program that takes away a lot of memory and has a big CPU load. I thought it was normal that if I started the same process while the first one was still running, it would take the same amount of memory and especially double the CPU Load. I found out that it doesn't take away more memory and does not affect the CPU load.
To find out if this behavior came from my program, I wrote a tiny c++ program that has extremely high memory usage, here it is:
#include <iostream>
using namespace std;
int main()
{
for(int i = 0; i<100; i++) {
float a[100][100][100];
for (int i2 = 0; i2 < 99; ++i2) {
for (int i3 = 0; i3 < 99; ++i3){
for (int i4 = 0; i4 < 99; ++i4){
a[i2][i3][i4] = i2*i3*i4;
cout << a[i2][i3][i4] << endl;
}
}
}
}
return 0;
}
The CPU-load is about at 30 % of the max-level, I started the code in one terminal. Strangely, when I started it in another terminal at the same time, it didnt affect the CPU load. I concluded that this behaviour couldn't come from my program.
Now I want to know:
Is there a "lock" that ensures that a certain type of process does not grill your cores?
Why don't two identical processes double the CPU load?
Well, I found out that there is a "lock" that makes sure a process doesn't take away all memory and makes the CPU load go up to 100%. It seems that than more processes there are, than more CPU load is there, but not in a linear way.
Additionally, the code I wrote to look for the behaviour has only high memory usage, the 30% level came from the cout in the standard library . Multiple processes can use the command at the same time without increasing the CPU load, but it affects the speed of the printing.
When I found out that, I got suspicious about the programs speed. I used the analytics in my IDE for c++ to find out the duration of my original program, and indeed it was a bit more that two times slower.
That seems to be the solution I was looking for, but I think this is not really applicable to a large audience since the structure of the Raspberry Pi is very particular. I don't know how this works for other systems.
BTW: I could have guessed that there is a lock. I mean, if you start 10 processes that take away 15% of the CPU load max level, you would have 150% CPU usage. IMPOSSIBLE!
Here's a low level question. How CPU intensive is getting system time?
What is the source of the time? I know there is a hardware clock on the bios chip but I'm thinking that getting data from outside the CPU and RAM will need some hardware synchronization which may delay the read so I'm guessing the CPU may have its own clock. Feel free to correct me if I'm wrong in any way.
Does getting time incur a heavy system function call or is it in any way dependent on the used programming language?
I have just tested it using a C++ program:
clock_t started = clock();
clock_t endClock = started + CLOCKS_PER_SEC;
long itera = 0;
for (; clock() < endClock; itera++)
{
}
I get about 23 million iterations per second (Windows 7, 32bit, Visual Studio 2015, 2.6 GHz CPU). In terms of your question, I would not call this intensive.
In debug mode, I measured 18 million iterations per second.
In case the time is transformed into a localized timestamp, complicated calendar calculations (timezone, daylight saving time, ...) might significantly slow down the loop.
It is not easy to tell what happens inside the clock() call. For my system, it calls QueryPerfomanceCounter, but this recurs to other system functions as explained here.
Tuning
To reduce the time measurement overhead even further, you can measure in every 10th, 100th ... iteration.
The following measures once in 1024 iterations:
for (; (itera & 0x03FF) || (clock() < endClock); itera++)
{
}
This brings up the loop per second count to some 500 million.
Tuning with Timer Thread
The following yields a further improvement of some 10% paid with additional complexity:
std::atomic<bool> processing = true;
// launch a timer thread to clear the processing flag after 1s
std::thread t([&processing]() {
std::this_thread::sleep_for(std::chrono::seconds(1));
processing = false;
});
for (; (itera & 0x03FF) || processing; itera++)
{
}
t.join();
An extra thread is started which sleeps for one second and then sets a control variable. The main thread executes the loop until the timer threads signals the end of processing.
Is there anyway to get information about how many Garbage collection been performed for different generations from a dump file. When I try to run some psscor4 commands I get following.
0:003> !GCUsage
The garbage collector data structures are not in a valid state for traversal.
It is either in the "plan phase," where objects are being moved around, or
we are at the initialization or shutdown of the gc heap. Commands related to
displaying, finding or traversing objects as well as gc heap segments may not
work properly. !dumpheap and !verifyheap may incorrectly complain of heap
consistency errors.
Error: Requesting GC Heap data
0:003> !CLRUsage
The garbage collector data structures are not in a valid state for traversal.
It is either in the "plan phase," where objects are being moved around, or
we are at the initialization or shutdown of the gc heap. Commands related to
displaying, finding or traversing objects as well as gc heap segments may not
work properly. !dumpheap and !verifyheap may incorrectly complain of heap
consistency errors.
Error: Requesting GC Heap data
I can get output from eehpeap though, but it does not give me what I am looking for.
0:003> !EEHeap -gc
Number of GC Heaps: 1
generation 0 starts at 0x0000000002c81030
generation 1 starts at 0x0000000002c81018
generation 2 starts at 0x0000000002c81000
ephemeral segment allocation context: none
segment begin allocated size
0000000002c80000 0000000002c81000 0000000002c87fe8 0x6fe8(28648)
Large object heap starts at 0x0000000012c81000
segment begin allocated size
0000000012c80000 0000000012c81000 0000000012c9e358 0x1d358(119640)
Total Size: Size: 0x24340 (148288) bytes.
------------------------------
GC Heap Size: Size: 0x24340 (148288) bytes.
Dumps
You can see the number of garbage collections in performance monitor. However, the way performance counters work makes me believe that this information is not available in a dump file and probably even not available during live debugging.
Think of Debug.WriteLine(): once the text was written to the debug output, it is gone. If you didn't have DebugView running at the time, the information is lost. And that's good, otherwise it would look like a memory leak.
Performance counters (as I understand them) work in a similar fashion. Various "pings" are sent out for someone else (the performance monitor) to be recorded. If noone does, the ping with all its information is gone.
Live debugging
As already mentioned, you can try performance monitor. If you prefer WinDbg, you can use sxe clrn to see garbage collections happen.
PSSCOR
The commands you mentioned, do not show information about garbage collection count:
0:016> !gcusage
Number of GC Heaps: 1
------------------------------
GC Heap Size 0x36d498(3,593,368)
Total Commit Size 0000000000384000 (3 MB)
Total Reserved Size 0000000017c7c000 (380 MB)
0:016> !clrusage
Number of GC Heaps: 1
------------------------------
GC Heap Size 0x36d498(3,593,368)
Total Commit Size 0000000000384000 (3 MB)
Total Reserved Size 0000000017c7c000 (380 MB)
Note: I'm using PSSCOR2 here, since I have the same .NET 4.5 issue on this machine. But I expect the output of PSSCOR4 to be similar.
A single threaded program running a loop of 500 million iterations takes usually about 400 ms and occasionally about 1200 ms on a 2.8 GHz linux computer. This is observed in a program that does nothing else but measure the elapsed time of the loop. What accounts for this variability?
import java.io.PrintWriter;
import java.util.Date;
public class Tester
{
public static void main(String[] args)
{
PrintWriter pw = new PrintWriter(System.out, true);
Date d0 = new Date();
long t0 = d0.getTime();
for (long i = 0; i < 500000000; i++)
;
Date d1 = new Date();
long t1 = d1.getTime();
pw.format("%d\n", t1 - t0);
}
}
The garbage collections routines need to run some time, even if you don't have garbage to collect.
It is easy to assume that since you don't allocate or dereference memory then you will not need to collect garbage; however, the garbage collection routine is running in an independent thread and does not inspect the possible flows through your code to determine if it should run.
So, the garbage collection routine launches and finds no garbage to collect. That takes some time.
In addition, your program (that is, the entire JVM interpreting your classes) may have been swapped off the CPU by the operating system's need to immediately handle some interrupt. This can even happen in multi-core systems, depending on the CPU's core selection algorithms. Examples of items that the CPU must handle immediately include copying Ethernet memory buffers into system memory (to prevent dropped packets), capturing keyboard input, etc.
In short, if you want your analysis to mean something, you have to do real statistics on your benchmarking, as all sorts of external items can impact your running program's time.
We are creating a Real-Time Process in VxWorks 6.x, and we would like to limit the amount of memory which can be allocated to the heap. How do we do this?
When creating a RTP via rtpSpawn(), you can specify an environment variable which controls how the heap behaves.
There are 3 environment variables:
HEAP_INITIAL_SIZE - How much heap to allocate initially (defaults to 64K)
HEAP_MAX_SIZE - Maximum heap to allocate (defaults to no limit)
HEAP_INCR_SIZE - memory increment when adding to RTP heap (defaults to 1 virtual page)
The following code shows how to use the environment variables:
char * envp[] = {"HEAP_INITIAL_SIZE=0x20000", "HEAP_MAX_SIZE=0x100000", NULL);
rtpSpawn ("myrtp.vxe", NULL, envp, 100, 0x10000, 0, 0);
This can be done through the use of the HEAP_MAX_SIZE environment variable. If it is set, it limits the ability of the heap to grow beyond that size. It does not, however, limit the initial heap size.
See page 31