Issues with Valgrind when running Petsc - valgrind

I receive the following errors from valgrind.
==30996== Conditional jump or move depends on uninitialised value(s)
==30996== at 0x12B28904: ??? (in /usr/lib64/libmlx4-rdmav2.so)
==30996== by 0xE12CF9A: ibv_open_device (in /usr/lib64/libibverbs.so.1.0.0)
==30996== by 0xAAFA03B: btl_openib_component_init (in /sw/arcts/centos7/openmpi/1.10.2-gcc-4.8.5/lib/libmpi.so.12.0.2)
==30996== by 0xAAF0832: mca_btl_base_select (in /sw/arcts/centos7/openmpi/1.10.2-gcc-4.8.5/lib/libmpi.so.12.0.2)
==30996== by 0xAAF0160: mca_bml_r2_component_init (in /sw/arcts/centos7/openmpi/1.10.2-gcc-4.8.5/lib/libmpi.so.12.0.2)
==30996== by 0xAAEE95D: mca_bml_base_init (in /sw/arcts/centos7/openmpi/1.10.2-gcc-4.8.5/lib/libmpi.so.12.0.2)
==30996== by 0xABE96D9: mca_pml_ob1_component_init (in /sw/arcts/centos7/openmpi/1.10.2-gcc-4.8.5/lib/libmpi.so.12.0.2)
==30996== by 0xABE75A8: mca_pml_base_select (in /sw/arcts/centos7/openmpi/1.10.2-gcc-4.8.5/lib/libmpi.so.12.0.2)
==30996== by 0xAA98BD3: ompi_mpi_init (in /sw/arcts/centos7/openmpi/1.10.2-gcc-4.8.5/lib/libmpi.so.12.0.2)
==30996== by 0xAAB87EC: PMPI_Init_thread (in /sw/arcts/centos7/openmpi/1.10.2-gcc-4.8.5/lib/libmpi.so.12.0.2)
==30996== by 0x5D4664: PetscInitialize.part.3 (in /scratch/kfid_flux/ykmizu/ROMLSS/bin/ks_main.x)
==30996== by 0x49B5B4: main (in /scratch/kfid_flux/ykmizu/ROMLSS/bin/ks_main.x)
==30996==
and this error repeats itself over and over again. I don't understand why PetscInitialize would give me a hard time. It's one of the first things I call in my main.c file after I initialize ints and doubles and etc.
PetscInitialize(&argc, &argv, NULL, NULL);
SlepcInitialize(&argc, &argv, NULL, NULL);
PetscViewerPushFormat(PETSC_VIEWER_STDOUT_SELF, PETSC_VIEWER_ASCII_MATLAB);
Are these just false errors? Any help would be greatly appreciated. Getting a little desperate about this. Thank you.

There are discussions here.
It seems that you use Open MPI which is noisy under valgrind. You can try to compiler two versions of PETSc (so two different PETS_ARCHs): one uses the optimized MPI in your system, and another is built using MPICH with the configure option --download-mpich.
For debugging, you can select the PETSC_ARCH compiled with mpich. For performance evaluation, you can select another PETSC_ARCH compiled with optimized MPI of your platform.
Additionaly, if you want to use both PETSc and SLEPc, you can select either PetscInitialize or SlepcInitialize for start their environment. It makes no sense to repeat two times.
I hope it's helpful for you.

Related

How to generate valgrind suppressions without manual cut and paste?

I want to generate a suppressions file with --gen-suppressions in valgrind.
However, I do not want to have to go through thousands of lines of output the cut and paste out the suppressions and remove the valgrind stack traces / other valgrind output, and resolve .
Is there a way to do this easily? This seems like a very basic use case...
// I want this part vvvvv
{
<insert_a_suppression_name_here>
Memcheck:Leak
match-leak-kinds: reachable
fun:malloc
fun:strdup
fun:_XlcCreateLC
fun:_XlcDefaultLoader
fun:_XOpenLC
fun:_XrmInitParseInfo
obj:/usr/lib/x86_64-linux-gnu/libX11.so.6.3.0
fun:XrmGetStringDatabase
obj:/usr/lib/x86_64-linux-gnu/libX11.so.6.3.0
fun:XGetDefault
fun:GetXftDPI
fun:X11_InitModes_XRandR
fun:X11_InitModes
fun:X11_VideoInit
}
// I do not want this part vvvv
==187526== 2 bytes in 1 blocks are still reachable in loss record 2 of 137
==187526== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==187526== by 0x4B7C50E: strdup (strdup.c:42)
==187526== by 0x5922D81: _XlcResolveLocaleName (in /usr/lib/x86_64-linux-gnu/libX11.so.6.3.0)
==187526== by 0x5926387: ??? (in /usr/lib/x86_64-linux-gnu/libX11.so.6.3.0)
==187526== by 0x5925956: ??? (in /usr/lib/x86_64-linux-gnu/libX11.so.6.3.0)
==187526== by 0x592615C: _XlcCreateLC (in /usr/lib/x86_64-linux-gnu/libX11.so.6.3.0)
==187526== by 0x5943664: _XlcDefaultLoader (in /usr/lib/x86_64-linux-gnu/libX11.so.6.3.0)
==187526== by 0x592D995: _XOpenLC (in /usr/lib/x86_64-linux-gnu/libX11.so.6.3.0)
It is quite unlikely that all of the suppressions are different.
If you create a suppression like
{
XINIT-1
Memcheck:Leak
match-leak-kinds: reachable
fun:malloc
fun:strdup
fun:_XlcCreateLC
fun:_XlcDefaultLoader
fun:_XOpenLC
fun:_XrmInitParseInfo
obj:/usr/lib/x86_64-linux-gnu/libX11.so.6.3.0
}
Then re-run. Typically the error count will go down very quickly and you will only need to add a fairly small number of suppressions (single or low double digits).
(you need to apply your knowledge of the code and libs(s) to get a sensible stack depth for suppressions - too many stack entries and the suppression will be too specific and you need more suppressions, too few and you risk suppressing real problems).

Neon VLD consuming more cycles than what is expected?

I have a simple asm code which loads 12 quad registers of NEON, and have paralleled pairwise add instruction along with the load instruction ( to exploit the dual issue capability). I have verified the code here:
http://pulsar.webshaker.net/ccc/sample-d3a7fe78
As one can see, the code is taking around 13 cycles. But when I load the code on the board, the load instructions seems to take more than one cycle per load, I verified and found out that the VPADAL is taking 1 cycle as stated, but VLD1 is taking more than one cycle. Why is that?
I have taken care of the following:
The address is 16 byte aligned.
Have provided the alignment hint in the instruction vld1.64 {d0, d1} [r0,:128]!
Tried preload instruction pld [r0, #192], at places but that seems to add to the cycles instead of actually reducing the latency.
Can someone tell me what am I doing wrong, why this latency?
Other Details:
With reference to cortex-a8
arm-2009q1 cross compiler tool chain
coding in assembly
Your code is executing much slower than expected because as it's currently written, it's causing the perfect storm of pipeline stalls. On any modern CPU with a pipelined architecture, instructions can execute in one cycle under ideal conditions. The ideal conditions are that the instruction is not waiting for memory and doesn't have any register dependencies. The way you've written the code, you're not allowing for the delay in reading from memory and making the next instruction dependent on the results of the read. This is causing the worst possible performance. Also, I'm not sure why you're accumulating the pairwise adds into multiple registers. Try something like this:
veor.u16 q12,q12,q12 # clear accumulated sum
top_of_loop:
vld1.u16 {q0,q1},[r0,:128]!
vld1.u16 {q2,q3},[r0,:128]!
vpadal.u16 q12,q0
vpadal.u16 q12,q1
vpadal.u16 q12,q2
vpadal.u16 q12,q3
vld1.u16 {q0,q1},[r0,:128]!
vld1.u16 {q2,q3},[r0,:128]!
vpadal.u16 q12,q0
vpadal.u16 q12,q1
vpadal.u16 q12,q2
vpadal.u16 q12,q3
subs r1,r1,#8
bne top_of_loop
Experiment with different numbers of load instructions before executing the adds. The point is that you need to allow time for the read to occur before you can use the target register.
Note: Using Q4-Q7 is risky because they're non-volatile registers. On Android you will get random garbage appearing in these (especially Q4).

suppress warnings related to certain library

How can I tell valgrind to stop showing any kind of error related to a certain library? I got lots of reports that look like this:
==24152== Invalid write of size 8
==24152== at 0xD9FF876: ??? (in /usr/lib64/dri/fglrx_dri.so)
==24152== by 0x110647AF: ???
==24152== Address 0x7f3c98553f20 is not stack'd, malloc'd or (recently) free'd
I could prune them by the address (0x7fxxxxxxxxxx is not something that is allocated at userland), but my valgrind build seems not to accept --ignore-ranges=0x7f0000000000-0x7fffffffffff
You can generate suppression-lists using --gen-suppressions=all. Then you can add those to some .supp file under lib/valgrind.

Ignore functions in Valgrind memcheck

Valgrinding a program that uses openldap2's libldap is a chore because of OpenSSL's use of uninitialized memory. There exists a --ignore-fn option, but only for the massif subcomponent of Valgrind. Is there anything similar for memcheck to exclude traces in which certain functions appear?
==13795== Use of uninitialised value of size 8
==13795== at 0x6A9C8CF: ??? (in /lib64/libz.so.1.2.3)
==13795== by 0x6A9A63B: inflate (in /lib64/libz.so.1.2.3)
==13795== by 0x68035C1: ??? (in /lib64/libcrypto.so.1.0.0)
==13795== by 0x6802B9F: COMP_expand_block (in /lib64/libcrypto.so.1.0.0)
==13795== by 0x64ABBCD: ssl3_do_uncompress (in /lib64/libssl.so.1.0.0)
==13795== by 0x64ACA6F: ssl3_read_bytes (in /lib64/libssl.so.1.0.0)
==13795== by 0x64A9F2F: ??? (in /lib64/libssl.so.1.0.0)
==13795== by 0x56B3E61: ??? (in /usr/lib64/libldap-2.4.so.2.5.4)
==13795== by 0x5E4DB1B: ??? (in /usr/lib64/liblber-2.4.so.2.5.4)
==13795== by 0x5E4E96E: ber_int_sb_read (in /usr/lib64/liblber-2.4.so.2.5.4)
==13795== by 0x5E4B4A6: ber_get_next (in /usr/lib64/liblber-2.4.so.2.5.4)
==13795== by 0x568FB9E: ??? (in /usr/lib64/libldap-2.4.so.2.5.4)
You can create a suppression file and use it to suppress errors coming from certain sources: http://valgrind.org/docs/manual/manual-core.html#manual-core.suppress

Performance overhead of perform: in Smalltalk (specifically Squeak)

How much slower can I reasonably expect perform: to be than a literal message send, on average? Should I avoid sending perform: in a loop, similar to the admonishment given to Perl/Python programmers to avoid calling eval("...") (Compiler evaluate: in Smalltalk) in a loop?
I'm concerned mainly with Squeak, but interested in other Smalltalks as well. Also, is the overhead greater with the perform:with: variants? Thank you
#perform: is not like eval(). The problem with eval() (performance-wise, anyway) is that it has to compile the code you're sending it at runtime, which is a very slow operation. Smalltalk's #perform:, on the other hand, is equivalent to Ruby's send() or Objective-C's performSelector: (in fact, both of these languages were strongly inspired by Smalltalk). Languages like these already look up methods based on their name — #perform: just lets you specify the name at runtime rather than write-time. It doesn't have to parse any syntax or compile anything like eval().
It will be a little slower (the cost of one extra method call at least), but it isn't like eval(). Also, the variants with more arguments shouldn't show any difference in speed vs. just plain perform:whatever. I can't talk with that much experience about Squeak specifically, but this is how it generally works.
Here are some numbers from my machine (it is Smalltalk/X, but I guess the numbers are comparable - at least the ratios should be):
The called methods "foo" and "foo:" are a noops (i.e. consist of a ^self):
self foo ... 3.2 ns
self perform:#foo ... 3.3 ns
[self foo] value ... 12.5 ns (2 sends and 2 contexts)
[ ] value ... 3.1 ns (empty block)
Compiler valuate:('TestClass foo') ... 1.15 ms
self foo:123 ... 3.3 ns
self perform:#foo: with:123 ... 3.6 ns
[self foo:123] value ... 15 ns (2 sends and 2 contexts)
[self foo:arg] value:123 ... 23 ns (2 sends and 2 contexts)
Compiler valuate:('TestClass foo:123') ... 1.16 ms
Notice the big difference between "perform:" and "evaluate:"; evaluate is calling the compiler to parse the string, generate a throw-away method (bytecode), execute it (it is jitted on the first call) and finally discarded. The compiler is actually written to be used mainly for the IDE and to fileIn code from external streams; it has code for error reporting, warning messages etc.
In general, eval is not what you want when performance is critical.
Timings from a Dell Vostro; your milage may vary, but the ratios not.
I tried to get the net execution times, by measuring the empty loop time and subtracting;
also, I ran the tests 10 times and took the best times, to eliminate OS/network/disk/email or whatever disturbances. However, I did not really care for a load-free machine.
The measure code was (replaced the second timesRepeat-arg with the stuff above):
callFoo2
|t1 t2|
t1 :=
TimeDuration toRun:[
100000000 timesRepeat:[]
].
t2 :=
TimeDuration toRun:[
100000000 timesRepeat:[self foo:123]
].
Transcript showCR:t2-t1
EDIT:
PS: I forgot to mention: these are the times from within the IDE (i.e. bytecode-jitted execution). Statically compiled code (using the stc-compiler) will generally be a bit faster (20-30%) on these low-level micro benchmarks, due to a better register allocation algorithm.
EDIT: I tried to reproduce these numbers the other day, but got completely different results (8ns for the simple call, but 9ns for the perform). So be very careful with these micro-timings, as they run completely out of the first-level cache (and empty messages even omit the context setup, or get inlined) - they are usually not very representative of the overall performance.