I run valgrind-3.10.0 to search for memory leaks in my fortran program. I'm using gfortran-4.9.0 to compile on OS X 10.9.5. From what I can tell from the below output, the memory leak is in a gfortran library. Am I correct? If so, is there anything that I can do?
HEAP SUMMARY:
==30650== in use at exit: 25,727 bytes in 390 blocks
==30650== total heap usage: 34,130 allocs, 33,740 frees, 11,306,357 bytes allocated
==30650==
==30650== Searching for pointers to 390 not-freed blocks
==30650== Checked 9,113,592 bytes
==30650==
==30650== 72 (36 direct, 36 indirect) bytes in 1 blocks are definitely lost in loss record 52 of 84
==30650== at 0x47E1: malloc (vg_replace_malloc.c:300)
==30650== by 0x345AB0: __Balloc_D2A (in /usr/lib/system/libsystem_c.dylib)
==30650== by 0x345CF6: __i2b_D2A (in /usr/lib/system/libsystem_c.dylib)
==30650== by 0x34362E: __dtoa (in /usr/lib/system/libsystem_c.dylib)
==30650== by 0x36A8A9: __vfprintf (in /usr/lib/system/libsystem_c.dylib)
==30650== by 0x3912DA: __v2printf (in /usr/lib/system/libsystem_c.dylib)
==30650== by 0x376F66: _vsnprintf (in /usr/lib/system/libsystem_c.dylib)
==30650== by 0x376FC5: vsnprintf_l (in /usr/lib/system/libsystem_c.dylib)
==30650== by 0x3674DC: snprintf (in /usr/lib/system/libsystem_c.dylib)
==30650== by 0xE2F6D: write_float (in /usr/local/gfortran/lib/libgfortran.3.dylib)
==30650== by 0xE53A4: _gfortrani_write_real (in /usr/local/gfortran/lib/libgfortran.3.dylib)
==30650== by 0x3FA9999999999999: ???
==30650==
==30650== LEAK SUMMARY:
==30650== definitely lost: 36 bytes in 1 blocks
==30650== indirectly lost: 36 bytes in 1 blocks
==30650== possibly lost: 0 bytes in 0 blocks
==30650== still reachable: 316 bytes in 7 blocks
==30650== suppressed: 25,339 bytes in 381 blocks
==30650== Reachable blocks (those to which a pointer was found) are not shown.
==30650== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==30650==
==30650== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 15 from 15)
--30650--
--30650-- used_suppression: 34 OSX109:6-Leak /usr/local/lib/valgrind/default.supp:797 suppressed: 13,656 bytes in 252 blocks
--30650-- used_suppression: 1 OSX109:1-Leak /usr/local/lib/valgrind/default.supp:747 suppressed: 2,064 bytes in 1 blocks
--30650-- used_suppression: 13 OSX109:7-Leak /usr/local/lib/valgrind/default.supp:808 suppressed: 7,181 bytes in 78 blocks
--30650-- used_suppression: 11 OSX109:10-Leak /usr/local/lib/valgrind/default.supp:839 suppressed: 1,669 bytes in 29 blocks
--30650-- used_suppression: 10 OSX109:9-Leak /usr/local/lib/valgrind/default.supp:829 suppressed: 609 bytes in 15 blocks
--30650-- used_suppression: 5 OSX109:5-Leak /usr/local/lib/valgrind/default.supp:787 suppressed: 144 bytes in 5 blocks
--30650-- used_suppression: 1 OSX109:3-Leak /usr/local/lib/valgrind/default.supp:765 suppressed: 16 bytes in 1 blocks
==30650==
==30650== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 15 from 15)
This could very well be a bug in the gfortran library.
Your best bet would be to reduce this to a self-contained test case and report it to the gfortran developers at fortran#gcc.gnu.org or to submit a bug report at http://www.gnu.org/bugzilla .
Related
I have a doubt regarding memory allocation in VXWorks.
It looks like VxWorks allocates a few bytes more than requested
Scenario 1:
I request for 64 bytes. Vxworks allocates 66 bytes. Diff of 2 bytes
Scenario 2:
I request for memory of 88 bytes. Vxworks allcoates 96 bytes.Diff of 8 bytes.
The spec for my CPU says it should get 5.336GB/s bandwidth to memory. To test this, I wrote a simple program that runs memset (or memcpy) on a big array and reports the timing. I'm showing 3.8GB/s on memset and 1.9GB/s on memcpy. http://en.wikipedia.org/wiki/Intel_Core_(microarchitecture) says my Q9400 should be getting 5.336MB/s. What's wrong?
I've tried replacing memset or memcpy with assignment loops. I've googled around to try to learn about memory alignment. I've tried different compiler flags. I've spent an embarrassing number of hours on this. Thanks for any help you can provide!
I'm using Ubuntu 12.04 with libc-dev version 2.15-0ubuntu10.5 and kernel 3.8.0-37-generic
The code:
#include <stdio.h>
#include <time.h>
#include <string.h>
#include <stdlib.h>
#define numBytes ((long)(1024*1024*1024))
#define numTransfers ((long)(8))
int main(int argc,char**argv){
if(argc!=3){
printf("Usage: %s BLOCK_SIZE_IN_BYTES NUMBER_OF_BLOCKS_TO_TRANSFER\n",argv[0]);
return -1;
}
char*__restrict__ source=(char*)malloc(numBytes);
char*__restrict__ dest=(char*)malloc(numBytes);
struct timespec start,end;
long totalTimeMs;
int i;
clock_gettime(CLOCK_MONOTONIC_RAW,&start);
for(i=0;i<numTransfers;++i)
memset(source,0,numBytes);
clock_gettime(CLOCK_MONOTONIC_RAW,&end);
totalTimeMs=(end.tv_nsec-start.tv_nsec)*.000001+1000*(end.tv_sec-start.tv_sec);
printf("memset %ld bytes %ld times (%.2fGB total) in %ldms (%.3fGB/s). ",numBytes,numTransfers,numBytes/1024.0/1024/1024*numTransfers,totalTimeMs,numBytes/1024.0/1024/1024*1000*numTransfers/totalTimeMs);
clock_gettime(CLOCK_MONOTONIC_RAW,&start);
for(i=0;i<numTransfers;++i)
memcpy( dest, source, numBytes);
clock_gettime(CLOCK_MONOTONIC_RAW,&end);
totalTimeMs=(end.tv_nsec-start.tv_nsec)*.000001+1000*(end.tv_sec-start.tv_sec);
printf("memcpy %ld bytes %ld times (%.2fGB total) in %ldms (%.3fGB/s).\n",numBytes,numTransfers,numBytes/1024.0/1024/1024*numTransfers,totalTimeMs,numBytes/1024.0/1024/1024*1000*numTransfers/totalTimeMs);
free(source);
free(dest);
return EXIT_SUCCESS;
}
Compile commands:
gcc -O3 -DNDEBUG -o memcpyStackOverflowNoParameters.c.o -c memcpyStackOverflowNoParameters.c
gcc -O3 -DNDEBUG memcpyStackOverflowNoParameters.c.o -o memcpy -rdynamic -lrt
Sample outputs:
memset 1073741824 bytes 8 times (8.00GB total) in 2214ms (3.880GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4466ms (1.923GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2218ms (3.873GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4557ms (1.885GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2222ms (3.866GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4433ms (1.938GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2216ms (3.876GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4521ms (1.900GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2217ms (3.875GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4520ms (1.900GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2218ms (3.873GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4430ms (1.939GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2226ms (3.859GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4444ms (1.933GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2225ms (3.861GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4485ms (1.915GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2620ms (3.279GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4855ms (1.769GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2535ms (3.389GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4870ms (1.764GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2423ms (3.545GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4905ms (1.751GB/s).
My hardware according to lshw:
product: OptiPlex 960 ()
vendor: Winbond Electronics
width: 64 bits
*-core
description: Motherboard
product: 0Y958C
vendor: Winbond Electronics
*-firmware
description: BIOS
capabilities: pci pnp apm upgrade shadowing escd cdboot bootselect edd int13floppytoshiba int13floppy720 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification netboot
*-cpu
product: Intel(R) Core(TM)2 Quad CPU Q9400 # 2.66GHz
physical id: 400
size: 2666MHz
width: 64 bits
clock: 1333MHz
capabilities: x86-64 fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm tpr_shadow vnmi flexpriority
configuration: cores=4 enabledcores=4 threads=4
*-cache:0
description: L1 cache
physical id: 700
size: 256KiB
capacity: 256KiB
capabilities: internal write-back unified
*-cache:1
description: L2 cache
physical id: 701
size: 6MiB
capacity: 6MiB
capabilities: internal varies unified
*-memory
description: System Memory
physical id: 1000
slot: System board or motherboard
size: 4GiB
*-bank:0
description: DIMM DDR2 Synchronous 667 MHz (1.5 ns)
product: CT51264AA667.M16FC
vendor: 7F7F7F7F7F9B0000
slot: DIMM_1
size: 4GiB
width: 64 bits
clock: 667MHz (1.5ns)
*-bank:1
description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]
*-bank:2
description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]
*-bank:3
description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]
Memory addresses are "virtualized", the addresses your program uses are translated to real addresses. This translation makes it possible to allocate what your program sees as contiguous memory from whatever pieces are handy at the time. Every general-purpose CPU does this. The translation requires a table lookup, which requires memory access. The CPU's got caches for it, but long stretches of virtual addresses can easily blow its cache, the "TLB" ("translation lookaside buffer"). So every 4KB (2MB on a linux system that's figured out what you're doing) the CPU stalls hunting up where to really send your memory traffic. Those stalls can take quite a bit of time. You might try running two copies of your benchmark, it seems reasonable that the TLB misses won't coincide and you'll get aggregate bandwidth much closer to your rated capacity.
(edit: um, you might want to replace your #defines with
size_t numBytes=atoi(argv[1]);
size_t numTransfers=atoi(argv[2]);
in the main body ...)
Edit: by the way: the bandwidth I saw (and reported in comments) from this test on my box was so far below rated capacity for my cpu that it got me investigating my own system. My box builder had put really crap memory in those slots. I've long since replaced them with a known-good brand, more than doubled the reported throughput, and very visibly improved the performance of my machine.
Last I checked memcpy and memset were not optimized in GCC. This was still true in 2012. See Agner Fog's Optimizing software in C++ section 2.6
2.6 "Choice of function libraries" and Table 2.1. He compares for several different compilers and OS's.
GCC has built in functions for doing memcpy. Apparently, they are even worse than memcpy in Glib. As far as I understand the GCC developers and the Glib developers work independently. To get the libraries from Glib you need to use -fno-builtin. However, although Glib is (or at least was) better it's still not optimal. To get the best results use Agner Fog's asmlib. He has optimized memcpy and memset and many other common functions in assembly to take advantage of SSE and AVX among other optimizations.
I am using valgrind on a program which runs an infinite loop.
As memcheck displays the memory leaks after the end of the program, but as my program has infinite loop it will never end.
So is there any way i can forcefully dump the data from valgrind time to time.
Thanks
Have a look at the client requests feature of memcheck. You can probably use VALGRIND_DO_LEAK_CHECK or similar.
EDIT:
In response to the statement above that this doesn't work. Here is an example program which loops forever:
#include <valgrind/memcheck.h>
#include <unistd.h>
#include <cstdlib>
int main(int argc, char* argv[])
{
while(true) {
char* leaked = new char[1];
VALGRIND_DO_LEAK_CHECK;
sleep(1);
}
return EXIT_SUCCESS;
}
When I run this in valgrind, I get an endless output of new leaks:
$ valgrind ./a.out
==16082== Memcheck, a memory error detector
==16082== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==16082== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==16082== Command: ./a.out
==16082==
==16082== LEAK SUMMARY:
==16082== definitely lost: 0 bytes in 0 blocks
==16082== indirectly lost: 0 bytes in 0 blocks
==16082== possibly lost: 0 bytes in 0 blocks
==16082== still reachable: 1 bytes in 1 blocks
==16082== suppressed: 0 bytes in 0 blocks
==16082== Reachable blocks (those to which a pointer was found) are not shown.
==16082== To see them, rerun with: --leak-check=full --show-reachable=yes
==16082==
==16082== 1 bytes in 1 blocks are definitely lost in loss record 2 of 2
==16082== at 0x4C2BF77: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==16082== by 0x4007EE: main (testme.cc:9)
==16082==
==16082== LEAK SUMMARY:
==16082== definitely lost: 1 bytes in 1 blocks
==16082== indirectly lost: 0 bytes in 0 blocks
==16082== possibly lost: 0 bytes in 0 blocks
==16082== still reachable: 1 bytes in 1 blocks
==16082== suppressed: 0 bytes in 0 blocks
==16082== Reachable blocks (those to which a pointer was found) are not shown.
==16082== To see them, rerun with: --leak-check=full --show-reachable=yes
==16082==
==16082== 2 bytes in 2 blocks are definitely lost in loss record 2 of 2
==16082== at 0x4C2BF77: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==16082== by 0x4007EE: main (testme.cc:9)
==16082==
==16082== LEAK SUMMARY:
==16082== definitely lost: 2 bytes in 2 blocks
==16082== indirectly lost: 0 bytes in 0 blocks
==16082== possibly lost: 0 bytes in 0 blocks
==16082== still reachable: 1 bytes in 1 blocks
==16082== suppressed: 0 bytes in 0 blocks
==16082== Reachable blocks (those to which a pointer was found) are not shown.
==16082== To see them, rerun with: --leak-check=full --show-reachable=yes
The program does not terminate.
with valgrind 3.7.0, you can trigger (a.o.) leak search from the shell,
using vgdb.
See e.g. http://www.valgrind.org/docs/manual/mc-manual.html#mc-manual.monitor-commands
(you can do these monitor commands from gdb or from a shell command line, using vgdb).
Use of VALGRIND_DO_LEAK_CHECK (acm answer) works for me.
Remarks :
- Program has to be launch with valgrind (valgrind myProg ...)
- valgrind-devel package has to be installed (to have )
I'm using the command:
valgrind --tool=memcheck --leak-check=yes ./prog
When this runs with a test script, I get no inline error messages or warnings, I just get a Heap Summary and a leak summary.
Am I missing a flag or something?
==31420== HEAP SUMMARY:
==31420== in use at exit: 1,580 bytes in 10 blocks
==31420== total heap usage: 47 allocs, 37 frees, 7,132 bytes allocated
==31420==
==31420== 1,580 (1,440 direct, 140 indirect) bytes in 5 blocks are definitely lost in loss record 2 of 2
==31420== at 0x4C274A8: malloc (vg_replace_malloc.c:236)
==31420== by 0x400FD4: main (lab1.c:51)
==31420==
==31420== LEAK SUMMARY:
==31420== definitely lost: 1,440 bytes in 5 blocks
==31420== indirectly lost: 140 bytes in 5 blocks
==31420== possibly lost: 0 bytes in 0 blocks
==31420== still reachable: 0 bytes in 0 blocks
==31420== suppressed: 0 bytes in 0 blocks
==31420==
==31420== For counts of detected and suppressed errors, rerun with: -v
==31420== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 4 from 4)
The last time I used valgrind (a few days ago) it would print out error messages as they occurred, in addition to the heap and leak summaries.
EDIT:
I tried leak-check=full, same result
The line mentioned in the heap summary (lab1.c:51) is:
temp_record = malloc(sizeof(struct server_record));
And I use this pointer pretty often in my code. That is what was so helpful about the valgrind error messages before, they would show me when I would lose my pointer to this malloc or other problems.
I try to profile a simple c prog using valgrind:
[zsun#nel6005001 ~]$ valgrind --tool=memcheck ./fl.out
==2238== Memcheck, a memory error detector
==2238== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==2238== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info
==2238== Command: ./fl.out
==2238==
==2238==
==2238== HEAP SUMMARY:
==2238== in use at exit: 1,168 bytes in 1 blocks
==2238== total heap usage: 1 allocs, 0 frees, 1,168 bytes allocated
==2238==
==2238== LEAK SUMMARY:
==2238== definitely lost: 0 bytes in 0 blocks
==2238== indirectly lost: 0 bytes in 0 blocks
==2238== possibly lost: 0 bytes in 0 blocks
==2238== still reachable: 1,168 bytes in 1 blocks
==2238== suppressed: 0 bytes in 0 blocks
==2238== Rerun with --leak-check=full to see details of leaked memory
==2238==
==2238== For counts of detected and suppressed errors, rerun with: -v
==2238== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 12 from 8)
Profiling timer expired
The c code I am trying to profile is the following:
void forloop(void){
int fac=1;
int count=5;
int i,k;
for (i = 1; i <= count; i++){
for(k=1;k<=count;k++){
fac = fac * i;
}
}
}
"Profiling timer expired" shows up, what does it mean? How to solve this problem? thx!
The problem is that you are using valgrind on a program compiled with -pg. You cannot use valgrind and gprof together. The valgrind manual suggests using OProfile if you are on Linux and need to profile the actual emulation of the program under valgrind.
By the way, this isn't computing factorial.
If you're really trying to find out where the time goes, you could try stackshots. I put an infinite loop around your code and took 10 of them. Here's the code:
6: void forloop(void){
7: int fac=1;
8: int count=5;
9: int i,k;
10:
11: for (i = 1; i <= count; i++){
12: for(k=1;k<=count;k++){
13: fac = fac * i;
14: }
15: }
16: }
17:
18: int main(int argc, char* argv[])
19: {
20: int i;
21: for (;;){
22: forloop();
23: }
24: return 0;
25: }
And here are the stackshots, re-ordered with the most frequent at the top:
forloop() line 12
main() line 23
forloop() line 12 + 21 bytes
main() line 23
forloop() line 12 + 21 bytes
main() line 23
forloop() line 12 + 9 bytes
main() line 23
forloop() line 13 + 7 bytes
main() line 23
forloop() line 13 + 3 bytes
main() line 23
forloop() line 6 + 22 bytes
main() line 23
forloop() line 14
main() line 23
forloop() line 7
main() line 23
forloop() line 11 + 9 bytes
main() line 23
What does this tell you? It says that line 12 consumes about 40% of the time, and line 13 consumes about 20% of the time. It also tells you that line 23 consumes nearly 100% of the time.
That means unrolling the loop at line 12 might potentially give you a speedup factor of 100/(100-40) = 100/60 = 1.67x approximately. Of course there are other ways to speed up this code as well, such as by eliminating the inner loop, if you're really trying to compute factorial.
I'm just pointing this out because it's a bone-simple way to do profiling.
You are not going to be able to compute 10000! like that. You will need some sort of bignum implementation for computing factorials. This is because int is "usually" 4 bytes long which means that "usually" it can hold 2^32 - 1 (signed int, 2^31) - 13! is more than that. Even if you used an unsigned long ("usually" 8 bytes) you'd overflow by the time you reached 21!.
As for what it "profiling timer expired" means - it means valgrind received the signal SIGPROF: http://en.wikipedia.org/wiki/SIGPROF (probably means your program took too long).