jstat for G1 Garbage Collector - jvm

I am trying to analyze memory usage pattern of Java Process with G1 Garbage Collector using jstat:
jstat -gc <Process_ID> 60s
The output looks like following:
S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT GCT
0.0 229376.0 0.0 229376.0 1998848.0 1253376.0 16646144.0 301183.5 50176.0 40977.8 8704.0 5303.9 10 0.296 0 0.000 0.296
As understood, jstat provides information about Young Generation GC as well as Full GC. But it doesn't distinguish between Minor and Mixed collections. Considering that in an well tuned G1 collector, Full GC is not expected and mostly Mixed GC takes care of Tenured generations, I want to get information about different types YGC.
Is there any specific option for jstat which I should use?
I have noticed this discussion on Open JDK forum, but not sure if such feature is available at this point of time.
Please note, I am aware of the fact that GC logs can help me here, but I am specifically interested about jstat (considering it's light weight and can be used in production as per the need basis).

You can see this blog https://blogs.oracle.com/poonam/entry/understanding_g1_gc_logs , which has more detailed information about understanding the G1GC logs

At least as of JDK11, there's three different time counters for -gc:
YGCT: Young Garbage Collection Time, spent on young-only garbage collection.
FGCT: Full Garbage Collection Time, spent on the fallback full stop-the-world GC.
GCT: Total time spent on garbage collections of all types.
So, young-only gets accounted to YGCT and GCT, mixed goes to GCT only (not sure whether the Young portion is accounted to YGCT, but I suspect so based on the times), full/fallback goes to FGCT and GCT. So, GCT minus FGCT minus YGCT equals time spent collecting the old-gen within mixed collections.

Related

Finding the Stokes Number of a Microcarrier Particle

I'm trying to model the flow and suspension of microcarriers (particles that are used as surfaces for cells to attach to and grow on) in a CFD application. I know some basic characteristics of the particles (they're called "Cytodex", about 180 µm big, density is 1.03g/cm^2) but I'd like to find the Stokes number to determine how strongly they are affected by turbulence and movement of the fluid. Can somebody point me to how to approach this (or at least approximate?). It's surprisingly hard to find any information for somebody like me who hasn't got a very strong background in fluid mechanics.
Here is the manufacturer's microcarrier manual. See page 62, Table 12 for Cytodex 1 physical properties.
https://www.gelifesciences.co.kr/wp-content/uploads/2016/07/023.8_Microcarrier-Cell-Culture.pdf
See this SlideShare, slide 15, for how to calculate the Stokes # for Cytodex 1 microcarriers: https://www.slideshare.net/rjrishabhjain/bs-4sedimentation?from_action=save
but for Cytodex 1 correct the d=180 um, for cell culture media=nutrient broth viscosity = 0.96 cP, media density ~ 1.007g/mL, microcarrier density 1.03 g/mL, to get settling velocity of 0.062cm/s = 3.72 cm/min. However, per the manufacturer's manual settling velocity is 12-16mL/min. Might be an error. I am seeing an answer.
For CFD modeling of microcarriers in bioreactors see: Loubière
https://pdfs.semanticscholar.org/d955/75b5c640c8268fd1ec51b2ce46862e7bbfbd.pdf
I have more related literature if you have interest.
DApple

Number of GC perfomred for various generations from a dump file

Is there anyway to get information about how many Garbage collection been performed for different generations from a dump file. When I try to run some psscor4 commands I get following.
0:003> !GCUsage
The garbage collector data structures are not in a valid state for traversal.
It is either in the "plan phase," where objects are being moved around, or
we are at the initialization or shutdown of the gc heap. Commands related to
displaying, finding or traversing objects as well as gc heap segments may not
work properly. !dumpheap and !verifyheap may incorrectly complain of heap
consistency errors.
Error: Requesting GC Heap data
0:003> !CLRUsage
The garbage collector data structures are not in a valid state for traversal.
It is either in the "plan phase," where objects are being moved around, or
we are at the initialization or shutdown of the gc heap. Commands related to
displaying, finding or traversing objects as well as gc heap segments may not
work properly. !dumpheap and !verifyheap may incorrectly complain of heap
consistency errors.
Error: Requesting GC Heap data
I can get output from eehpeap though, but it does not give me what I am looking for.
0:003> !EEHeap -gc
Number of GC Heaps: 1
generation 0 starts at 0x0000000002c81030
generation 1 starts at 0x0000000002c81018
generation 2 starts at 0x0000000002c81000
ephemeral segment allocation context: none
segment begin allocated size
0000000002c80000 0000000002c81000 0000000002c87fe8 0x6fe8(28648)
Large object heap starts at 0x0000000012c81000
segment begin allocated size
0000000012c80000 0000000012c81000 0000000012c9e358 0x1d358(119640)
Total Size: Size: 0x24340 (148288) bytes.
------------------------------
GC Heap Size: Size: 0x24340 (148288) bytes.
Dumps
You can see the number of garbage collections in performance monitor. However, the way performance counters work makes me believe that this information is not available in a dump file and probably even not available during live debugging.
Think of Debug.WriteLine(): once the text was written to the debug output, it is gone. If you didn't have DebugView running at the time, the information is lost. And that's good, otherwise it would look like a memory leak.
Performance counters (as I understand them) work in a similar fashion. Various "pings" are sent out for someone else (the performance monitor) to be recorded. If noone does, the ping with all its information is gone.
Live debugging
As already mentioned, you can try performance monitor. If you prefer WinDbg, you can use sxe clrn to see garbage collections happen.
PSSCOR
The commands you mentioned, do not show information about garbage collection count:
0:016> !gcusage
Number of GC Heaps: 1
------------------------------
GC Heap Size 0x36d498(3,593,368)
Total Commit Size 0000000000384000 (3 MB)
Total Reserved Size 0000000017c7c000 (380 MB)
0:016> !clrusage
Number of GC Heaps: 1
------------------------------
GC Heap Size 0x36d498(3,593,368)
Total Commit Size 0000000000384000 (3 MB)
Total Reserved Size 0000000017c7c000 (380 MB)
Note: I'm using PSSCOR2 here, since I have the same .NET 4.5 issue on this machine. But I expect the output of PSSCOR4 to be similar.

Neon VLD consuming more cycles than what is expected?

I have a simple asm code which loads 12 quad registers of NEON, and have paralleled pairwise add instruction along with the load instruction ( to exploit the dual issue capability). I have verified the code here:
http://pulsar.webshaker.net/ccc/sample-d3a7fe78
As one can see, the code is taking around 13 cycles. But when I load the code on the board, the load instructions seems to take more than one cycle per load, I verified and found out that the VPADAL is taking 1 cycle as stated, but VLD1 is taking more than one cycle. Why is that?
I have taken care of the following:
The address is 16 byte aligned.
Have provided the alignment hint in the instruction vld1.64 {d0, d1} [r0,:128]!
Tried preload instruction pld [r0, #192], at places but that seems to add to the cycles instead of actually reducing the latency.
Can someone tell me what am I doing wrong, why this latency?
Other Details:
With reference to cortex-a8
arm-2009q1 cross compiler tool chain
coding in assembly
Your code is executing much slower than expected because as it's currently written, it's causing the perfect storm of pipeline stalls. On any modern CPU with a pipelined architecture, instructions can execute in one cycle under ideal conditions. The ideal conditions are that the instruction is not waiting for memory and doesn't have any register dependencies. The way you've written the code, you're not allowing for the delay in reading from memory and making the next instruction dependent on the results of the read. This is causing the worst possible performance. Also, I'm not sure why you're accumulating the pairwise adds into multiple registers. Try something like this:
veor.u16 q12,q12,q12 # clear accumulated sum
top_of_loop:
vld1.u16 {q0,q1},[r0,:128]!
vld1.u16 {q2,q3},[r0,:128]!
vpadal.u16 q12,q0
vpadal.u16 q12,q1
vpadal.u16 q12,q2
vpadal.u16 q12,q3
vld1.u16 {q0,q1},[r0,:128]!
vld1.u16 {q2,q3},[r0,:128]!
vpadal.u16 q12,q0
vpadal.u16 q12,q1
vpadal.u16 q12,q2
vpadal.u16 q12,q3
subs r1,r1,#8
bne top_of_loop
Experiment with different numbers of load instructions before executing the adds. The point is that you need to allow time for the read to occur before you can use the target register.
Note: Using Q4-Q7 is risky because they're non-volatile registers. On Android you will get random garbage appearing in these (especially Q4).

Where does the limitation of 10^15 in D.J. Bernstein's 'primegen' program come from?

At http://cr.yp.to/primegen.html you can find sources of program that uses Atkin's sieve to generate primes. As the author says that it may take few months to answer an e-mail sent to him (I understand that, he sure is an occupied man!) I'm posting this question.
The page states that 'primegen can generate primes up to 1000000000000000'. I am trying to understand why it is so. There is of course a limitation up to 2^64 ~ 2 * 10^19 (size of long unsigned int) because this is how the numbers are represented. I know for sure that if there would be a huge prime gap (> 2^31) then printing of numbers would fail. However in this range I think there is no such prime gap.
Either the author overestimated the bound (and really it is around 10^19) or there is a place in the source code where the arithmetic operation can overflow or something like that.
The funny thing is that you actually MAY run it for numbers > 10^15:
./primes 10000000000000000 10000000000000100
10000000000000061
10000000000000069
10000000000000079
10000000000000099
and if you believe Wolfram Alpha, it is correct.
Some facts I had "reverse-engineered":
numbers are sifted in batches of 1,920 * PRIMEGEN_WORDS = 3,932,160 numbers (see primegen_fill function in primegen_next.c)
PRIMEGEN_WORDS controls how big a single sifting is - you can adjust it in primegen_impl.h to fit your CPU cache,
the implementation of the sieve itself is in primegen.c file - I assume it is correct; what you get is a bitmask of primes in pg->buf (see primegen_fill function)
The bitmask is analyzed and primes are stored in pg->p array.
I see no point where the overflow may happen.
I wish I was on my computer to look, but I suspect you would have different success if you started at 1 as your lower bound.
Just from the algorithm, I would conclude that the upper bound comes from the 32 bit numbers.
The page mentiones Pentium-III as CPU so my guess it is very old and does not use 64 bit.
2^32 are approx 10^9. Sieve of Atkins (which the algorithm uses) requires N^(1/2) bits (it uses a big bitfield). Which means in 2^32 big memory you can make (conservativ) N approx 10^15. As this number is a rough conservative upper bound (you have system and other programs occupying memory, reserving address ranges for IO,...) the real upper bound is/might be higher.

What are the most hardcore optimisations you've seen?

I'm not talking about algorithmic stuff (eg use quicksort instead of bubblesort), and I'm not talking about simple things like loop unrolling.
I'm talking about the hardcore stuff. Like Tiny Teensy ELF, The Story of Mel; practically everything in the demoscene, and so on.
I once wrote a brute force RC5 key search that processed two keys at a time, the first key used the integer pipeline, the second key used the SSE pipelines and the two were interleaved at the instruction level. This was then coupled with a supervisor program that ran an instance of the code on each core in the system. In total, the code ran about 25 times faster than a naive C version.
In one (here unnamed) video game engine I worked with, they had rewritten the model-export tool (the thing that turns a Maya mesh into something the game loads) so that instead of just emitting data, it would actually emit the exact stream of microinstructions that would be necessary to render that particular model. It used a genetic algorithm to find the one that would run in the minimum number of cycles. That is to say, the data format for a given model was actually a perfectly-optimized subroutine for rendering just that model. So, drawing a mesh to the screen meant loading it into memory and branching into it.
(This wasn't for a PC, but for a console that had a vector unit separate and parallel to the CPU.)
In the early days of DOS when we used floppy discs for all data transport there were viruses as well. One common way for viruses to infect different computers was to copy a virus bootloader into the bootsector of an inserted floppydisc. When the user inserted the floppydisc into another computer and rebooted without remembering to remove the floppy, the virus was run and infected the harddrive bootsector, thus permanently infecting the host PC. A particulary annoying virus I was infected by was called "Form", to battle this I wrote a custom floppy bootsector that had the following features:
Validate the bootsector of the host harddrive and make sure it was not infected.
Validate the floppy bootsector and
make sure that it was not infected.
Code to remove the virus from the
harddrive if it was infected.
Code to duplicate the antivirus
bootsector to another floppy if a
special key was pressed.
Code to boot the harddrive if all was
well, and no infections was found.
This was done in the program space of a bootsector, about 440 bytes :)
The biggest problem for my mates was the very cryptic messages displayed because I needed all the space for code. It was like "FFVD RM?", which meant "FindForm Virus Detected, Remove?"
I was quite happy with that piece of code. The optimization was program size, not speed. Two quite different optimizations in assembly.
My favorite is the floating point inverse square root via integer operations. This is a cool little hack on how floating point values are stored and can execute faster (even doing a 1/result is faster than the stock-standard square root function) or produce more accurate results than the standard methods.
In c/c++ the code is: (sourced from Wikipedia)
float InvSqrt (float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1); // Now this is what you call a real magic number
x = *(float*)&i;
x = x*(1.5f - xhalf*x*x);
return x;
}
A Very Biological Optimisation
Quick background: Triplets of DNA nucleotides (A, C, G and T) encode amino acids, which are joined into proteins, which are what make up most of most living things.
Ordinarily, each different protein requires a separate sequence of DNA triplets (its "gene") to encode its amino acids -- so e.g. 3 proteins of lengths 30, 40, and 50 would require 90 + 120 + 150 = 360 nucleotides in total. However, in viruses, space is at a premium -- so some viruses overlap the DNA sequences for different genes, using the fact that there are 6 possible "reading frames" to use for DNA-to-protein translation (namely starting from a position that is divisible by 3; from a position that divides 3 with remainder 1; or from a position that divides 3 with remainder 2; and the same again, but reading the sequence in reverse.)
For comparison: Try writing an x86 assembly language program where the 300-byte function doFoo() begins at offset 0x1000... and another 200-byte function doBar() starts at offset 0x1001! (I propose a name for this competition: Are you smarter than Hepatitis B?)
That's hardcore space optimisation!
UPDATE: Links to further info:
Reading Frames on Wikipedia suggests Hepatitis B and "Barley Yellow Dwarf" virus (a plant virus) both overlap reading frames.
Hepatitis B genome info on Wikipedia. Seems that different reading-frame subunits produce different variations of a surface protein.
Or you could google for "overlapping reading frames"
Seems this can even happen in mammals! Extensively overlapping reading frames in a second mammalian gene is a 2001 scientific paper by Marilyn Kozak that talks about a "second" gene in rat with "extensive overlapping reading frames". (This is quite surprising as mammals have a genome structure that provides ample room for separate genes for separate proteins.) Haven't read beyond the abstract myself.
I wrote a tile-based game engine for the Apple IIgs in 65816 assembly language a few years ago. This was a fairly slow machine and programming "on the metal" is a virtual requirement for coaxing out acceptable performance.
In order to quickly update the graphics screen one has to map the stack to the screen in order to use some special instructions that allow one to update 4 screen pixels in only 5 machine cycles. This is nothing particularly fantastic and is described in detail in IIgs Tech Note #70. The hard-core bit was how I had to organize the code to make it flexible enough to be a general-purpose library while still maintaining maximum speed.
I decomposed the graphics screen into scan lines and created a 246 byte code buffer to insert the specialized 65816 opcodes. The 246 bytes are needed because each scan line of the graphics screen is 80 words wide and 1 additional word is required on each end for smooth scrolling. The Push Effective Address (PEA) instruction takes up 3 bytes, so 3 * (80 + 1 + 1) = 246 bytes.
The graphics screen is rendered by jumping to an address within the 246 byte code buffer that corresponds to the right edge of the screen and patching in a BRanch Always (BRA) instruction into the code at the word immediately following the left-most word. The BRA instruction takes a signed 8-bit offset as its argument, so it just barely has the range to jump out of the code buffer.
Even this isn't too terribly difficult, but the real hard-core optimization comes in here. My graphics engine actually supported two independent background layers and animated tiles by using different 3-byte code sequences depending on the mode:
Background 1 uses a Push Effective Address (PEA) instruction
Background 2 uses a Load Indirect Indexed (LDA ($00),y) instruction followed by a push (PHA)
Animated tiles use a Load Direct Page Indexed (LDA $00,x) instruction followed by a push (PHA)
The critical restriction is that both of the 65816 registers (X and Y) are used to reference data and cannot be modified. Further the direct page register (D) is set based on the origin of the second background and cannot be changed; the data bank register is set to the data bank that holds pixel data for the second background and cannot be changed; the stack pointer (S) is mapped to graphics screen, so there is no possibility of jumping to a subroutine and returning.
Given these restrictions, I had the need to quickly handle cases where a word that is about to be pushed onto the stack is mixed, i.e. half comes from Background 1 and half from Background 2. My solution was to trade memory for speed. Because all of the normal registers were in use, I only had the Program Counter (PC) register to work with. My solution was the following:
Define a code fragment to do the blend in the same 64K program bank as the code buffer
Create a copy of this code for each of the 82 words
There is a 1-1 correspondence, so the return from the code fragment can be a hard-coded address
Done! We have a hard-coded subroutine that does not affect the CPU registers.
Here is the actual code fragments
code_buff: PEA $0000 ; rightmost word (16-bits = 4 pixels)
PEA $0000 ; background 1
PEA $0000 ; background 1
PEA $0000 ; background 1
LDA (72),y ; background 2
PHA
LDA (70),y ; background 2
PHA
JMP word_68 ; mix the data
word_68_rtn: PEA $0000 ; more background 1
...
PEA $0000
BRA *+40 ; patched exit code
...
word_68: LDA (68),y ; load data for background 2
AND #$00FF ; mask
ORA #$AB00 ; blend with data from background 1
PHA
JMP word_68_rtn ; jump back
word_66: LDA (66),y
...
The end result was a near-optimal blitter that has minimal overhead and cranks out more than 15 frames per second at 320x200 on a 2.5 MHz CPU with a 1 MB/s memory bus.
Michael Abrash's "Zen of Assembly Language" had some nifty stuff, though I admit I don't recall specifics off the top of my head.
Actually it seems like everything Abrash wrote had some nifty optimization stuff in it.
The Stalin Scheme compiler is pretty crazy in that aspect.
I once saw a switch statement with a lot of empty cases, a comment at the head of the switch said something along the lines of:
Added case statements that are never hit because the compiler only turns the switch into a jump-table if there are more than N cases
I forget what N was. This was in the source code for Windows that was leaked in 2004.
I've gone to the Intel (or AMD) architecture references to see what instructions there are. movsx - move with sign extension is awesome for moving little signed values into big spaces, for example, in one instruction.
Likewise, if you know you only use 16-bit values, but you can access all of EAX, EBX, ECX, EDX , etc- then you have 8 very fast locations for values - just rotate the registers by 16 bits to access the other values.
The EFF DES cracker, which used custom-built hardware to generate candidate keys (the hardware they made could prove a key isn't the solution, but could not prove a key was the solution) which were then tested with a more conventional code.
The FSG 2.0 packer made by a Polish team, specifically made for packing executables made with assembly. If packing assembly isn't impressive enough (what's supposed to be almost as low as possible) the loader it comes with is 158 bytes and fully functional. If you try packing any assembly made .exe with something like UPX, it will throw a NotCompressableException at you ;)