Need Activemq MemoryUsage optimization - my queues are overflowing - activemq

We have java based message processing system in which there are nearly 25 different queues and a topic. We have set this system at a max memory usage of 2GB and processes 40 messages per second on a normal day. This system is working fine for a couple days and then starts to spike on the memory, then reaches the limit.
In our analysis, we found the MemoryUsage holds the key for this cause and below is a leak suspect stack trace of the heap dump of one of the queue which is using nearly 50% of memory. It could be possible that higher volume message could have loaded the queue heavily. What is the optimal configuration for the MemoryUsage to be set up for this system?
519,955,448 (62.85%) [72] 8 org/apache/activemq/usage/MemoryUsage 0x80d8d180
519,843,456 (62.84%) [16] 2 java/util/concurrent/CopyOnWriteArrayList 0x80d8d210
519,843,392 (62.84%) [352] 89 array of java/lang/Object 0x822cd2e0
411,721,616 (49.77%) [72] 9 org/apache/activemq/usage/MemoryUsage 0x83833378
411,721,248 (49.77%) [16] 2 java/util/concurrent/CopyOnWriteArrayList 0x83835898
411,721,184 (49.77%) [8] 2 array of java/lang/Object 0x8383a730
411,718,600 (49.77%) [336] 33 org/apache/activemq/broker/region/Queue 0x83833120
411,693,720 (49.77%) [16] 2 org/apache/activemq/store/kahadb/KahaDBTransactionStore$1 0x838353e0
411,693,256 (49.77%) [24] 3 org/apache/activemq/store/kahadb/KahaDBTransactionStore 0x80d76aa0
411,689,856 (49.76%) [280] 37 org/apache/activemq/store/kahadb/KahaDBStore 0x80d74de0
358,088,168 (43.29%) [104] 14 org/apache/kahadb/journal/Journal 0x80d76790
356,119,216 (43.05%) [48] 1 java/util/concurrent/ConcurrentHashMap 0x80d773c0
356,119,168 (43.05%) [64] 16 array of java/util/concurrent/ConcurrentHashMap$Segment 0x80d8e628

It's hard to speculate too much with this limited amount of information. If your consumers fall behind, the memory will start to fill and there is nothing really you can do about it. At 40 msgs per sec, then it will go fast, I guess.
What you can do is to overflow the queue to disk after some memory limit. That would slow it down, but at least have it running during a spike.
The area itself is generally complex and as far as I know, there is no silver bullet.
Read on cursors etc and on memory usage and producer flow control.

Related

CANOpen network load higher than expected

I am working on a project with a master computer connected via a CANOpen network to 4 slaves.
At each time step, the computer receives a measurement message from each slave, and sends them a control message. In total, 4 messages are received and 4 messages are sent at each time sample.
The message sent is a PDO with 6 data bytes (8 bytes including COB-ID)
The message received is a PDO with 8 data bytes (10 bytes including COB-ID)
My CAN network is configured at 1Mbit/s, and I run my program at 1000 Hz (1 ms sampling time). As the total load resulting from the messages described is 576 bits/cycle, the total load expected in the network is 576kbit/s, or 57%.
What I see, however, is that:
The controlling computer measures a load of ~86% (with minima of 68% and peaks of 100%).
A USB CAN bus analyser I connect to the network registers a traffic
of messages (count-wise) that is around half of what I nominally
expect (i.e., 4 sent, 4 received each cycle, for 50 seconds should result in 50k messages, while I only see 18-25k). Moreover, I receive
1-2 error messages per cycle from the slave devices that the
network is overloaded. Before it is pointed out, even counting the
size of these messages as part of traffic wouldn't get close to
explain the anomaly in load.
What I'd like to know is whether my way of calculating the CANOpen network load is correct. For instance, are there any protocol-specific handshakes, CRCs, or any sort of extra bytes sent to make the network simply work? It's nothing I could see in the wiki page of CANOpen, but I do know there are such appendices to messages in the original CAN bus standard.
In a CAN message, there is more than the data to be transmitted.
There is also the arbitration ID (11- or 29bits, depending on whether you use CAN 2.0A or 2.0B), there is a 15 bit CRC, an 7 bit EOF marker, the control field and also some other reserved bits.
Depending on the data, there may also be stuff bits.
Using CAN2.0B and assuming 48 bits (6 bytes) of data, you will get a message size of roughly 132 bits and roughly 151 bits for your 64 bits messages.
Summing this up, you will get roughly 1132 bits per cycle which is too much for a 1Mbit/s bus and 1000 Hz.
Hope that helps.

Number of GC perfomred for various generations from a dump file

Is there anyway to get information about how many Garbage collection been performed for different generations from a dump file. When I try to run some psscor4 commands I get following.
0:003> !GCUsage
The garbage collector data structures are not in a valid state for traversal.
It is either in the "plan phase," where objects are being moved around, or
we are at the initialization or shutdown of the gc heap. Commands related to
displaying, finding or traversing objects as well as gc heap segments may not
work properly. !dumpheap and !verifyheap may incorrectly complain of heap
consistency errors.
Error: Requesting GC Heap data
0:003> !CLRUsage
The garbage collector data structures are not in a valid state for traversal.
It is either in the "plan phase," where objects are being moved around, or
we are at the initialization or shutdown of the gc heap. Commands related to
displaying, finding or traversing objects as well as gc heap segments may not
work properly. !dumpheap and !verifyheap may incorrectly complain of heap
consistency errors.
Error: Requesting GC Heap data
I can get output from eehpeap though, but it does not give me what I am looking for.
0:003> !EEHeap -gc
Number of GC Heaps: 1
generation 0 starts at 0x0000000002c81030
generation 1 starts at 0x0000000002c81018
generation 2 starts at 0x0000000002c81000
ephemeral segment allocation context: none
segment begin allocated size
0000000002c80000 0000000002c81000 0000000002c87fe8 0x6fe8(28648)
Large object heap starts at 0x0000000012c81000
segment begin allocated size
0000000012c80000 0000000012c81000 0000000012c9e358 0x1d358(119640)
Total Size: Size: 0x24340 (148288) bytes.
------------------------------
GC Heap Size: Size: 0x24340 (148288) bytes.
Dumps
You can see the number of garbage collections in performance monitor. However, the way performance counters work makes me believe that this information is not available in a dump file and probably even not available during live debugging.
Think of Debug.WriteLine(): once the text was written to the debug output, it is gone. If you didn't have DebugView running at the time, the information is lost. And that's good, otherwise it would look like a memory leak.
Performance counters (as I understand them) work in a similar fashion. Various "pings" are sent out for someone else (the performance monitor) to be recorded. If noone does, the ping with all its information is gone.
Live debugging
As already mentioned, you can try performance monitor. If you prefer WinDbg, you can use sxe clrn to see garbage collections happen.
PSSCOR
The commands you mentioned, do not show information about garbage collection count:
0:016> !gcusage
Number of GC Heaps: 1
------------------------------
GC Heap Size 0x36d498(3,593,368)
Total Commit Size 0000000000384000 (3 MB)
Total Reserved Size 0000000017c7c000 (380 MB)
0:016> !clrusage
Number of GC Heaps: 1
------------------------------
GC Heap Size 0x36d498(3,593,368)
Total Commit Size 0000000000384000 (3 MB)
Total Reserved Size 0000000017c7c000 (380 MB)
Note: I'm using PSSCOR2 here, since I have the same .NET 4.5 issue on this machine. But I expect the output of PSSCOR4 to be similar.

Software memory testing for bus failures

I have a board with quite a few flash chips, some of them are showing intermittent failures. Standard memory tests are not showing any specific problem addresses, other than certain chips are failing intermittently under mechanical and thermal stress.
Suspecting the actual connections and not the flash cells themselves, I'm looking for a way to test the parallel bus for address or data pin errors.
There are some memory tests but they apply better to RAM rather than flash memory (http://www.ganssle.com/testingram.htm). Specifically, the parallel flash has a sequence of bus writes to write to each value; a write/verify failure could easily be the write operation which could be any pin on the bus.
Ideas welcome...
The typical memory tests are there to do that. I prefer a pseudo randomizer (deterministic using an lfsr) to the 0xAA, 0x55, 0xFF, 0x00 tests. This allows for an address bus test as well as data bus test in two passes (repeat inverted). I say typical in the sense of wiggle the data bits and address bits both states each and vary the states of signals and their neighbors. The pounding on a ram to create thermal or other stresses, well you cant write very fast to a flash so you cant really do fast write/read cycles.
Flash creates another problem and that is writing then reading back isnt that interesting, you want to write the read back later, hours, days, weeks to determine if the part is actually holding data.
When you say thermal or stress do you mean only during the time it is above X degrees it fails, or do you mean that due to thermal stress it is broken all the time after the event. Likewise with mechanical, while vibrating or under mechanical stress the part fails, but when relieved of that stress it is okay, or the mechanical stress has done permanent damage that can be detected under stress or not.
Now although you cant do fast write/read cycles, you can punish a flash by reading heavily. I have seen read-disturb problems by constant reading of one block or location. Not necessarily something you have time to do for every location, but you might fill the ram with a pseudo random pattern and concentrate on one location for a while, (minutes, tens of minutes), if you have a part that you know is bad see if this accelerates the detection of the problem and if any location will work or only certain ones. then another thing is to read all the locations repetitively for hours/days or leave it sit for hours/days/weeks and then do a read pass without an erase or write and see if it has lost anything.
unfortunately as you probably know each new failure case takes its own research project and development of a new test.
First step to test a memory is data bus test0 0 0 0 0 0 0 • In this test, data bus wiring is properly tested to0 0 0 0 0 0 0 confirm that the value placed on data bus by processor0 0 0 0 0 0 0 is correctly received by memory device at the other end0 0 0 0 0 0 00 0 0 0 0 0 0 • An obvious way to test is to write all possible0 0 0 0 0 0 0 data values and verify 0 0 0 0 0 0 0 • Each bit can be tested independently• To perform walking 1s test, write the first data value given in the table, verify by reading it back, write the second value, verify and so on. • When you reach the end of the table, the test is complete
In the linked article Jack Ganssle says: "Critical to this [test], and every other RAM test algorithm, is that you write the pattern to all of RAM before doing the read test."
Since reading should be isolated from writing, testing the flash is easier. Perform the writing portion of the tests while the system is not under stress. Then perform the reading portion with the system under stress. By recording the address, expected value, and actual value in enough error cases, you should be able to determine the source of the errors.
If the system never fails when doing the above, you can then perform the whole tests while under stress. Any errors that appear are most likely write errors.
I've decided to design a memory pattern that I think I can deduce both data and address errors from. The concept is to use values significantly different as key indicators of possible read errors. The concept is also to detect a failure on one pin at a time.
The test will read alternately from only bottom and top addresses (0x000000 and 0x3FFFFF - my chip has 22 address lines). In those locations I will put 0xFF and 0x00 respectively (byte wide). The idea is to flip all address and data lines and see what happens. (All other values in the flash have at least 3 bits different from 0x00 and 0xFF)
There are 44 addresses that a single pin failure could send me to in error. In each address put one of 22 values to represent which of the 22 address pin was flipped. Each are 2 bits different from each other, and 3 bits different from 00 and FF. (I tried for 3 bits different from each other but 8 bits could only get 14 values)
07,0B,0D,0E,16,1A,1C,1F,25,29,2C,
2F,34,38,3D,3E,43,49,4A,4F,52,58
The remaining addresses I put a nice pattern of six values 33,55,66,99,AA,CC. (3 bits different from all other values) value(address) = nicePattern[ sum of bits set in address % 6];
I tested this and have statistically collected 100s of intermittent failure incidents synchronized to the mechanical stress.
single bit errors detectable
double bit errors deducible (Explainable by a combination of frequent single bit errors)
3 or more bit errors (generally inconclusive)
Even though some of the chips had 3 failing pins, 70% of the incidents were single bit (they usually didn't fail at the same time)
The testing group is now using this to identify which specific connections are failing.

NASM Prefetching

I ran across the below instructions in the NASM documentation, but I can't quite make heads or tails of them. Sadly, the Intel documentation on these instructions is also somewhat lacking.
PREFETCHNTA m8 ; 0F 18 /0 [KATMAI]
PREFETCHT0 m8 ; 0F 18 /1 [KATMAI]
PREFETCHT1 m8 ; 0F 18 /2 [KATMAI]
PREFETCHT2 m8 ; 0F 18 /3 [KATMAI]
Could anyone possibly provide a concise example of the instructions, say to cache 256 bytes at a given address? Thanks in advance!
These instructions are hints used to suggest that the CPU try to prefetch a cache line into the cache. Because they're hints, a CPU can ignore them completely.
If the CPU does support them, then the CPU will try to prefetch but will give up (and won't prefetch) if a TLB miss would be involved. This is where most people get it wrong (e.g. fail to do "preloading", where you insert a dummy read to force a TLB load so that prefetching isn't prevented from working).
The amount of data prefetched is 32 bytes or more, depending on CPU, etc. You can use CPUID to determine the actual size (CPUID function 0x00000004, the "System Coherency Line Size" returned in EBX bits 0 to 31).
If you prefetch too late it doesn't help, and if you prefetch too early the data can be evicted from the cache before it's used (which also doesn't help). There's an appendix in Intel's "IA-32 Intel Architecture Optimisation Reference Manual" that describes how to calculate when to prefetch, called "Mathematics of Prefetch Scheduling Distance" that you should probably read.
Also don't forget that prefetching can decrease performance (e.g. cause data that is needed to be evicted to make room) and that if you don't prefetch anything the CPU has a hardware prefetcher that will probably do it for you anyway. You should probably also read about how this hardware prefetcher works (and when it doesn't). For example, for sequential reads (e.g. memcmp()) the hardware prefetcher does it for you and using explicit prefetches is mostly a waste of time. It's probably only worth bothering with explicit prefetches for "random" (non-sequential) accesses that the CPU's hardware prefetcher can't/won't predict.
After sifting through some examples of heavily-optimized memcmp functions and the like, I've figured out how to use these instructions (somewhat) effectively.
These instructions imply a cache "line" of 32 bytes, something I missed originally. Thus, to cache a 256 byte buffer into L1 and L2, the following instruction set could be used:
prefetcht1 [buffer]
prefetcht1 [buffer+32]
prefetcht1 [buffer+64]
prefetcht1 [buffer+96]
prefetcht1 [buffer+128]
prefetcht1 [buffer+160]
prefetcht1 [buffer+192]
prefetcht1 [buffer+224]
The t0 suffix instructs the CPU to prefetch it into the entire cache hierarchy.
t1 instructs that the data be cached into L1, L2, and so on.
t2 continues this trend, prefetching into L2 and such.
The "nta" suffix is a bit more confusing, as it tells the CPU to write the data straight to memory (ideally), as opposed to reading/writing cache lines. This can actually be quite useful in the case of incredibly large data structures, as cache pollution can be avoided and more relevant data can instead be cached.

GPU shared memory size is very small - what can I do about it?

The size of the shared memory ("local memory" in OpenCL terms) is only 16 KiB on most nVIDIA GPUs of today.
I have an application in which I need to create an array that has 10,000 integers. so the amount of memory I will need to fit 10,000 integers = 10,000 * 4b = 40kb.
How can I work around this?
Is there any GPU that has more than 16 KiB of shared memory ?
Think of shared memory as explicitly managed cache. You will need to store your array in global memory and cache parts of it in shared memory as needed, either by making multiple passes or some other scheme which minimises the number of loads and stores to/from global memory.
How you implement this will depend on your algorithm - if you can give some details of what it is exactly that you are trying to implement you may get some more concrete suggestions.
One last point - be aware that shared memory is shared between all threads in a block - you have way less than 16 kb per thread, unless you have a single data structure which is common to all threads in a block.
All compute capability 2.0 and greater devices (most in the last year or two) have 48KB of available shared memory per multiprocessor. That begin said, Paul's answer is correct in that you likely will not want to load all 10K integers into a single multiprocessor.
You can try to use cudaFuncSetCacheConfig(nameOfKernel, cudaFuncCachePrefer{Shared, L1}) function.
If you prefer L1 to Shared, then 48KB will go to L1 and 16KB will go to Shared.
If you prefer Shared to L1, then 48KB will go to Shared and 16KB will go to L1.
Usage:
cudaFuncSetCacheConfig(matrix_multiplication, cudaFuncCachePreferShared);
matrix_multiplication<<<bla, bla>>>(bla, bla, bla);