Limit --memcheck To Your Own Code - valgrind

Lets say I am using a library that uses glibc. When I exit the program while running it through Valgrind all sorts of memory leaks are detected by Valgrind. I am 100% sure that none of the leaks are explicitly related to my few lines of code I just wrote. Is there a way to suppress leaks from other libraries, and limit the leak detection to your immediate code?
For example:
valgrind --tool=memcheck --leak-check=full --leak-resolution=high \
--log-file=vgdump ./Main
Where the executable was built from the following source:
// Include header files for application components.
#include <QtGui>
int main(int argc, char *argv[])
{
QApplication app(argc, argv);
QWidget window;
window.resize( 320,240 );
window.setWindowTitle(
QApplication::translate( "toplevel", "Top-level Widget" ) );
window.show( );
QPushButton button(
QApplication::translate( "childwidget", "Press me"), &window );
button.move( 100, 100 );
button.show( );
int status = app.exec();
return status;
}
Has a log-file that reports the following (large portions removed):
==12803== Memcheck, a memory error detector
==12803== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==12803== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info
==12803== Command: ./Main
==12803== Parent PID: 12700
==12803==
==12803==
==12803== HEAP SUMMARY:
==12803== in use at exit: 937,411 bytes in 8,741 blocks
==12803== total heap usage: 38,227 allocs, 29,486 frees, 5,237,254 bytes allocated
==12803==
==12803== 1 bytes in 1 blocks are possibly lost in loss record 1 of 4,557
==12803== at 0x402577E: malloc (vg_replace_malloc.c:195)
==12803== by 0xA1DFA4: g_malloc (in /lib/libglib-2.0.so.0.2600.0)
==12803== by 0xA37F29: g_strdup (in /lib/libglib-2.0.so.0.2600.0)
==12803== by 0xB2A6FA: g_param_spec_string (in /lib/libgobject-2.0.so.0.2600.0)
==12803== by 0x41F36473: ??? (in /usr/lib/libgtk-x11-2.0.so.0.2200.0)
==12803== by 0xB3D237: g_type_class_ref (in /lib/libgobject-2.0.so.0.2600.0)
==12803== by 0xB20B38: g_object_newv (in /lib/libgobject-2.0.so.0.2600.0)
==12803== by 0xB212EF: g_object_new (in /lib/libgobject-2.0.so.0.2600.0)
==12803== by 0x41F34857: gtk_settings_get_for_screen (in /usr/lib/libgtk-x11-2.0.so.0.2200.0)
==12803== by 0x41ED0CB6: ??? (in /usr/lib/libgtk-x11-2.0.so.0.2200.0)
==12803== by 0xB377C7: g_cclosure_marshal_VOID__OBJECT (in /lib/libgobject-2.0.so.0.2600.0)
==12803== by 0xB1ABE2: g_closure_invoke (in /lib/libgobject-2.0.so.0.2600.0)
...
==12803== 53,244 bytes in 29 blocks are possibly lost in loss record 4,557 of 4,557
==12803== at 0x402577E: malloc (vg_replace_malloc.c:195)
==12803== by 0xA1DFA4: g_malloc (in /lib/libglib-2.0.so.0.2600.0)
==12803== by 0xA36050: g_slice_alloc (in /lib/libglib-2.0.so.0.2600.0)
==12803== by 0xA36315: g_slice_alloc0 (in /lib/libglib-2.0.so.0.2600.0)
==12803== by 0xB40077: g_type_create_instance (in /lib/libgobject-2.0.so.0.2600.0)
==12803== by 0xB1CE35: ??? (in /lib/libgobject-2.0.so.0.2600.0)
==12803== by 0xB205C6: g_object_newv (in /lib/libgobject-2.0.so.0.2600.0)
==12803== by 0xB212EF: g_object_new (in /lib/libgobject-2.0.so.0.2600.0)
==12803== by 0x6180FA3: ??? (in /usr/lib/gtk-2.0/2.10.0/engines/libclearlooks.so)
==12803== by 0x41F0CDDD: ??? (in /usr/lib/libgtk-x11-2.0.so.0.2200.0)
==12803== by 0x41F11C24: gtk_rc_get_style (in /usr/lib/libgtk-x11-2.0.so.0.2200.0)
==12803== by 0x4200A81F: ??? (in /usr/lib/libgtk-x11-2.0.so.0.2200.0)
==12803==
==12803== LEAK SUMMARY:
==12803== definitely lost: 2,296 bytes in 8 blocks
==12803== indirectly lost: 7,720 bytes in 382 blocks
==12803== possibly lost: 509,894 bytes in 2,908 blocks
==12803== still reachable: 417,501 bytes in 5,443 blocks
==12803== suppressed: 0 bytes in 0 blocks
==12803== Reachable blocks (those to which a pointer was found) are not shown.
==12803== To see them, rerun with: --leak-check=full --show-reachable=yes
==12803==
==12803== For counts of detected and suppressed errors, rerun with: -v
==12803== ERROR SUMMARY: 1364 errors from 1364 contexts (suppressed: 122 from 11)

To ignore Leak errors in all shared libraries under any lib directory (/lib, /lib64, /usr/lib, /usr/lib64, ...), put this in a file and pass it to valgrind with --suppressions=*FILENAME*:
{
ignore_unversioned_libs
Memcheck:Leak
...
obj:*/lib*/lib*.so
}
{
ignore_versioned_libs
Memcheck:Leak
...
obj:*/lib*/lib*.so.*
}
This will probably suffice to limit memcheck reporting to your own code only. However, beware that this will ignore errors caused by any callbacks you wrote that were invoked by the libraries. Catching errors in those callbacks could almost be done with:
{
ignore_unversioned_libs
Memcheck:Leak
obj:*/lib*/lib*.so
...
obj:*/lib*/lib*.so
}
{
ignore_versioned_libs
Memcheck:Leak
obj:*/lib*/lib*.so.*
...
obj:*/lib*/lib*.so.*
}
... but this reveals errors in calls by a library that use the Valgrind malloc. Since valgrind malloc is injected directly into the program text -- not loaded as a dynamic library -- it appears in the stack the same way as your own code does. This allows Valgrind to track the allocations, but also makes it harder to do exactly what you have asked.
FYI: I am using valgrind 3.5.
The above is an excerpt of an answer to an older, slightly different question that is asked in the body text of this question (so title is a little insufficient):
Is it possible to make valgrind ignore certain libraries?

Look up the topic of suppressions at the Valgrind web site; you want to suppress errors from the third party library.

Related

failed to alloc X bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory

I am trying to run a tensorflow project and I am encountering memory problems on the university HPC cluster. I have to run a prediction job for hundreds of inputs, with differing lengths. We have GPU nodes with different amounts of vmem, so I am trying to set up the scripts in a way that will not crash in any combination of GPU node - input length.
After searching the net for solutions, I played around with TF_FORCE_UNIFIED_MEMORY, XLA_PYTHON_CLIENT_MEM_FRACTION, XLA_PYTHON_CLIENT_PREALLOCATE, and TF_FORCE_GPU_ALLOW_GROWTH, and also with tensorflow's set_memory_growth. As I understood, with unified memory, I should be able to use more memory than a GPU has in itself.
This was my final solution (only relevant parts)
os.environ['TF_FORCE_UNIFIED_MEMORY']='1'
os.environ['XLA_PYTHON_CLIENT_MEM_FRACTION']='2.0'
#os.environ['XLA_PYTHON_CLIENT_PREALLOCATE']='false'
os.environ['TF_FORCE_GPU_ALLOW_GROWTH ']='true' # as I understood, this is redundant with the set_memory_growth part :)
import tensorflow as tf
gpus = tf.config.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
print(gpu)
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
and I submit it on the cluster with --mem=30G (slurm job scheduler) and --gres=gpu:1.
And this is the error my code crashes with. As I understand, it does try to use the unified memory but is failing for some reason.
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5582 MB memory) -> physical GPU (device: 0, name: GeForce GTX TITAN Black, pci bus id: 0000:02:00.0, compute capability: 3.5)
2021-08-24 09:22:02.053935: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 12758286336 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:03.738635: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 11482457088 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:05.418059: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 10334211072 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:07.102411: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 9300789248 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:08.784349: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 8370710016 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:10.468644: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 7533638656 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:12.150588: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 6780274688 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:23:10.326528: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.33GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Traceback (most recent call last):
File "scripts/script.py", line 654, in <module>
prediction_result, (r, t) = cf.to(model_runner.predict(processed_feature_dict, random_seed=seed), "cpu")
File "env/lib/python3.7/site-packages/alphafold/model/model.py", line 134, in predict
result, recycles = self.apply(self.params, jax.random.PRNGKey(random_seed), feat)
File "env/lib/python3.7/site-packages/jax/_src/traceback_util.py", line 183, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "env/lib/python3.7/site-packages/jax/_src/api.py", line 402, in cache_miss
donated_invars=donated_invars, inline=inline)
File "env/lib/python3.7/site-packages/jax/core.py", line 1561, in bind
return call_bind(self, fun, *args, **params)
File "env/lib/python3.7/site-packages/jax/core.py", line 1552, in call_bind
outs = primitive.process(top_trace, fun, tracers, params)
File "env/lib/python3.7/site-packages/jax/core.py", line 1564, in process
return trace.process_call(self, fun, tracers, params)
File "env/lib/python3.7/site-packages/jax/core.py", line 607, in process_call
return primitive.impl(f, *tracers, **params)
File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 608, in _xla_call_impl
*unsafe_map(arg_spec, args))
File "env/lib/python3.7/site-packages/jax/linear_util.py", line 262, in memoized_fun
ans = call(fun, *args)
File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 758, in _xla_callable
compiled = compile_or_get_cached(backend, built, options)
File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 76, in compile_or_get_cached
return backend_compile(backend, computation, compile_options)
File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 373, in backend_compile
return backend.compile(built_c, compile_options=options)
jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: Resource exhausted: Out of memory while trying to allocate 4649385984 bytes.
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
--------------------
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "scripts/script.py", line 654, in <module>
prediction_result, (r, t) = cf.to(model_runner.predict(processed_feature_dict, random_seed=seed), "cpu")
File "env/lib/python3.7/site-packages/alphafold/model/model.py", line 134, in predict
result, recycles = self.apply(self.params, jax.random.PRNGKey(random_seed), feat)
File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 373, in backend_compile
return backend.compile(built_c, compile_options=options)
RuntimeError: Resource exhausted: Out of memory while trying to allocate 4649385984 bytes.
I would be glad for any ideas on how to get it to work and use all the available memory.
Thank you!
Looks like your GPU doesn't fully support unified memory. The support is limited and in fact the GPU holds all data in its memory.
See this article for the description: https://developer.nvidia.com/blog/unified-memory-cuda-beginners/
In particular:
On systems with pre-Pascal GPUs like the Tesla K80, calling cudaMallocManaged() allocates size bytes of managed memory on the GPU device that is active when the call is made. Internally, the driver also sets up page table entries for all pages covered by the allocation, so that the system knows that the pages are resident on that GPU.
And:
Since these older GPUs can’t page fault, all data must be resident on the GPU just in case the kernel accesses it (even if it won’t).
And your GPU is Kepler-based, according to TechPowerUp: https://www.techpowerup.com/gpu-specs/geforce-gtx-titan-black.c2549
As far as I know, TensorFlow should also issue a warning about that. Something like:
Unified memory on GPUs with compute capability lower than 6.0 (pre-Pascal class GPUs) does not support oversubscription.
Probably this answer will be useful for you. This nvidia_smi python module have some useful tools like checking the gpu total memory. Here I reproduce the code of the answer I mentioned earlier.
import nvidia_smi
nvidia_smi.nvmlInit()
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
print("Total memory:", info.total)
nvidia_smi.nvmlShutdown()
I think this should be your starting point. A simple solution would be to set the batch size according to the gpu memory. If you only want to get predictions, except of the batch_size, usually there is no anything else that much memory intensive. Also, I would recommend, if there is any preprocessing done on gpu, pass it to cpu.

valgrind output points to gfortran library

I run valgrind-3.10.0 to search for memory leaks in my fortran program. I'm using gfortran-4.9.0 to compile on OS X 10.9.5. From what I can tell from the below output, the memory leak is in a gfortran library. Am I correct? If so, is there anything that I can do?
HEAP SUMMARY:
==30650== in use at exit: 25,727 bytes in 390 blocks
==30650== total heap usage: 34,130 allocs, 33,740 frees, 11,306,357 bytes allocated
==30650==
==30650== Searching for pointers to 390 not-freed blocks
==30650== Checked 9,113,592 bytes
==30650==
==30650== 72 (36 direct, 36 indirect) bytes in 1 blocks are definitely lost in loss record 52 of 84
==30650== at 0x47E1: malloc (vg_replace_malloc.c:300)
==30650== by 0x345AB0: __Balloc_D2A (in /usr/lib/system/libsystem_c.dylib)
==30650== by 0x345CF6: __i2b_D2A (in /usr/lib/system/libsystem_c.dylib)
==30650== by 0x34362E: __dtoa (in /usr/lib/system/libsystem_c.dylib)
==30650== by 0x36A8A9: __vfprintf (in /usr/lib/system/libsystem_c.dylib)
==30650== by 0x3912DA: __v2printf (in /usr/lib/system/libsystem_c.dylib)
==30650== by 0x376F66: _vsnprintf (in /usr/lib/system/libsystem_c.dylib)
==30650== by 0x376FC5: vsnprintf_l (in /usr/lib/system/libsystem_c.dylib)
==30650== by 0x3674DC: snprintf (in /usr/lib/system/libsystem_c.dylib)
==30650== by 0xE2F6D: write_float (in /usr/local/gfortran/lib/libgfortran.3.dylib)
==30650== by 0xE53A4: _gfortrani_write_real (in /usr/local/gfortran/lib/libgfortran.3.dylib)
==30650== by 0x3FA9999999999999: ???
==30650==
==30650== LEAK SUMMARY:
==30650== definitely lost: 36 bytes in 1 blocks
==30650== indirectly lost: 36 bytes in 1 blocks
==30650== possibly lost: 0 bytes in 0 blocks
==30650== still reachable: 316 bytes in 7 blocks
==30650== suppressed: 25,339 bytes in 381 blocks
==30650== Reachable blocks (those to which a pointer was found) are not shown.
==30650== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==30650==
==30650== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 15 from 15)
--30650--
--30650-- used_suppression: 34 OSX109:6-Leak /usr/local/lib/valgrind/default.supp:797 suppressed: 13,656 bytes in 252 blocks
--30650-- used_suppression: 1 OSX109:1-Leak /usr/local/lib/valgrind/default.supp:747 suppressed: 2,064 bytes in 1 blocks
--30650-- used_suppression: 13 OSX109:7-Leak /usr/local/lib/valgrind/default.supp:808 suppressed: 7,181 bytes in 78 blocks
--30650-- used_suppression: 11 OSX109:10-Leak /usr/local/lib/valgrind/default.supp:839 suppressed: 1,669 bytes in 29 blocks
--30650-- used_suppression: 10 OSX109:9-Leak /usr/local/lib/valgrind/default.supp:829 suppressed: 609 bytes in 15 blocks
--30650-- used_suppression: 5 OSX109:5-Leak /usr/local/lib/valgrind/default.supp:787 suppressed: 144 bytes in 5 blocks
--30650-- used_suppression: 1 OSX109:3-Leak /usr/local/lib/valgrind/default.supp:765 suppressed: 16 bytes in 1 blocks
==30650==
==30650== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 15 from 15)
This could very well be a bug in the gfortran library.
Your best bet would be to reduce this to a self-contained test case and report it to the gfortran developers at fortran#gcc.gnu.org or to submit a bug report at http://www.gnu.org/bugzilla .

Why is memset slow?

The spec for my CPU says it should get 5.336GB/s bandwidth to memory. To test this, I wrote a simple program that runs memset (or memcpy) on a big array and reports the timing. I'm showing 3.8GB/s on memset and 1.9GB/s on memcpy. http://en.wikipedia.org/wiki/Intel_Core_(microarchitecture) says my Q9400 should be getting 5.336MB/s. What's wrong?
I've tried replacing memset or memcpy with assignment loops. I've googled around to try to learn about memory alignment. I've tried different compiler flags. I've spent an embarrassing number of hours on this. Thanks for any help you can provide!
I'm using Ubuntu 12.04 with libc-dev version 2.15-0ubuntu10.5 and kernel 3.8.0-37-generic
The code:
#include <stdio.h>
#include <time.h>
#include <string.h>
#include <stdlib.h>
#define numBytes ((long)(1024*1024*1024))
#define numTransfers ((long)(8))
int main(int argc,char**argv){
if(argc!=3){
printf("Usage: %s BLOCK_SIZE_IN_BYTES NUMBER_OF_BLOCKS_TO_TRANSFER\n",argv[0]);
return -1;
}
char*__restrict__ source=(char*)malloc(numBytes);
char*__restrict__ dest=(char*)malloc(numBytes);
struct timespec start,end;
long totalTimeMs;
int i;
clock_gettime(CLOCK_MONOTONIC_RAW,&start);
for(i=0;i<numTransfers;++i)
memset(source,0,numBytes);
clock_gettime(CLOCK_MONOTONIC_RAW,&end);
totalTimeMs=(end.tv_nsec-start.tv_nsec)*.000001+1000*(end.tv_sec-start.tv_sec);
printf("memset %ld bytes %ld times (%.2fGB total) in %ldms (%.3fGB/s). ",numBytes,numTransfers,numBytes/1024.0/1024/1024*numTransfers,totalTimeMs,numBytes/1024.0/1024/1024*1000*numTransfers/totalTimeMs);
clock_gettime(CLOCK_MONOTONIC_RAW,&start);
for(i=0;i<numTransfers;++i)
memcpy( dest, source, numBytes);
clock_gettime(CLOCK_MONOTONIC_RAW,&end);
totalTimeMs=(end.tv_nsec-start.tv_nsec)*.000001+1000*(end.tv_sec-start.tv_sec);
printf("memcpy %ld bytes %ld times (%.2fGB total) in %ldms (%.3fGB/s).\n",numBytes,numTransfers,numBytes/1024.0/1024/1024*numTransfers,totalTimeMs,numBytes/1024.0/1024/1024*1000*numTransfers/totalTimeMs);
free(source);
free(dest);
return EXIT_SUCCESS;
}
Compile commands:
gcc -O3 -DNDEBUG -o memcpyStackOverflowNoParameters.c.o -c memcpyStackOverflowNoParameters.c
gcc -O3 -DNDEBUG memcpyStackOverflowNoParameters.c.o -o memcpy -rdynamic -lrt
Sample outputs:
memset 1073741824 bytes 8 times (8.00GB total) in 2214ms (3.880GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4466ms (1.923GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2218ms (3.873GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4557ms (1.885GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2222ms (3.866GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4433ms (1.938GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2216ms (3.876GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4521ms (1.900GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2217ms (3.875GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4520ms (1.900GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2218ms (3.873GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4430ms (1.939GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2226ms (3.859GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4444ms (1.933GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2225ms (3.861GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4485ms (1.915GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2620ms (3.279GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4855ms (1.769GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2535ms (3.389GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4870ms (1.764GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2423ms (3.545GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4905ms (1.751GB/s).
My hardware according to lshw:
product: OptiPlex 960 ()
vendor: Winbond Electronics
width: 64 bits
*-core
description: Motherboard
product: 0Y958C
vendor: Winbond Electronics
*-firmware
description: BIOS
capabilities: pci pnp apm upgrade shadowing escd cdboot bootselect edd int13floppytoshiba int13floppy720 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification netboot
*-cpu
product: Intel(R) Core(TM)2 Quad CPU Q9400 # 2.66GHz
physical id: 400
size: 2666MHz
width: 64 bits
clock: 1333MHz
capabilities: x86-64 fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm tpr_shadow vnmi flexpriority
configuration: cores=4 enabledcores=4 threads=4
*-cache:0
description: L1 cache
physical id: 700
size: 256KiB
capacity: 256KiB
capabilities: internal write-back unified
*-cache:1
description: L2 cache
physical id: 701
size: 6MiB
capacity: 6MiB
capabilities: internal varies unified
*-memory
description: System Memory
physical id: 1000
slot: System board or motherboard
size: 4GiB
*-bank:0
description: DIMM DDR2 Synchronous 667 MHz (1.5 ns)
product: CT51264AA667.M16FC
vendor: 7F7F7F7F7F9B0000
slot: DIMM_1
size: 4GiB
width: 64 bits
clock: 667MHz (1.5ns)
*-bank:1
description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]
*-bank:2
description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]
*-bank:3
description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]
Memory addresses are "virtualized", the addresses your program uses are translated to real addresses. This translation makes it possible to allocate what your program sees as contiguous memory from whatever pieces are handy at the time. Every general-purpose CPU does this. The translation requires a table lookup, which requires memory access. The CPU's got caches for it, but long stretches of virtual addresses can easily blow its cache, the "TLB" ("translation lookaside buffer"). So every 4KB (2MB on a linux system that's figured out what you're doing) the CPU stalls hunting up where to really send your memory traffic. Those stalls can take quite a bit of time. You might try running two copies of your benchmark, it seems reasonable that the TLB misses won't coincide and you'll get aggregate bandwidth much closer to your rated capacity.
(edit: um, you might want to replace your #defines with
size_t numBytes=atoi(argv[1]);
size_t numTransfers=atoi(argv[2]);
in the main body ...)
Edit: by the way: the bandwidth I saw (and reported in comments) from this test on my box was so far below rated capacity for my cpu that it got me investigating my own system. My box builder had put really crap memory in those slots. I've long since replaced them with a known-good brand, more than doubled the reported throughput, and very visibly improved the performance of my machine.
Last I checked memcpy and memset were not optimized in GCC. This was still true in 2012. See Agner Fog's Optimizing software in C++ section 2.6
2.6 "Choice of function libraries" and Table 2.1. He compares for several different compilers and OS's.
GCC has built in functions for doing memcpy. Apparently, they are even worse than memcpy in Glib. As far as I understand the GCC developers and the Glib developers work independently. To get the libraries from Glib you need to use -fno-builtin. However, although Glib is (or at least was) better it's still not optimal. To get the best results use Agner Fog's asmlib. He has optimized memcpy and memset and many other common functions in assembly to take advantage of SSE and AVX among other optimizations.

dump valgrind data

I am using valgrind on a program which runs an infinite loop.
As memcheck displays the memory leaks after the end of the program, but as my program has infinite loop it will never end.
So is there any way i can forcefully dump the data from valgrind time to time.
Thanks
Have a look at the client requests feature of memcheck. You can probably use VALGRIND_DO_LEAK_CHECK or similar.
EDIT:
In response to the statement above that this doesn't work. Here is an example program which loops forever:
#include <valgrind/memcheck.h>
#include <unistd.h>
#include <cstdlib>
int main(int argc, char* argv[])
{
while(true) {
char* leaked = new char[1];
VALGRIND_DO_LEAK_CHECK;
sleep(1);
}
return EXIT_SUCCESS;
}
When I run this in valgrind, I get an endless output of new leaks:
$ valgrind ./a.out
==16082== Memcheck, a memory error detector
==16082== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==16082== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==16082== Command: ./a.out
==16082==
==16082== LEAK SUMMARY:
==16082== definitely lost: 0 bytes in 0 blocks
==16082== indirectly lost: 0 bytes in 0 blocks
==16082== possibly lost: 0 bytes in 0 blocks
==16082== still reachable: 1 bytes in 1 blocks
==16082== suppressed: 0 bytes in 0 blocks
==16082== Reachable blocks (those to which a pointer was found) are not shown.
==16082== To see them, rerun with: --leak-check=full --show-reachable=yes
==16082==
==16082== 1 bytes in 1 blocks are definitely lost in loss record 2 of 2
==16082== at 0x4C2BF77: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==16082== by 0x4007EE: main (testme.cc:9)
==16082==
==16082== LEAK SUMMARY:
==16082== definitely lost: 1 bytes in 1 blocks
==16082== indirectly lost: 0 bytes in 0 blocks
==16082== possibly lost: 0 bytes in 0 blocks
==16082== still reachable: 1 bytes in 1 blocks
==16082== suppressed: 0 bytes in 0 blocks
==16082== Reachable blocks (those to which a pointer was found) are not shown.
==16082== To see them, rerun with: --leak-check=full --show-reachable=yes
==16082==
==16082== 2 bytes in 2 blocks are definitely lost in loss record 2 of 2
==16082== at 0x4C2BF77: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==16082== by 0x4007EE: main (testme.cc:9)
==16082==
==16082== LEAK SUMMARY:
==16082== definitely lost: 2 bytes in 2 blocks
==16082== indirectly lost: 0 bytes in 0 blocks
==16082== possibly lost: 0 bytes in 0 blocks
==16082== still reachable: 1 bytes in 1 blocks
==16082== suppressed: 0 bytes in 0 blocks
==16082== Reachable blocks (those to which a pointer was found) are not shown.
==16082== To see them, rerun with: --leak-check=full --show-reachable=yes
The program does not terminate.
with valgrind 3.7.0, you can trigger (a.o.) leak search from the shell,
using vgdb.
See e.g. http://www.valgrind.org/docs/manual/mc-manual.html#mc-manual.monitor-commands
(you can do these monitor commands from gdb or from a shell command line, using vgdb).
Use of VALGRIND_DO_LEAK_CHECK (acm answer) works for me.
Remarks :
- Program has to be launch with valgrind (valgrind myProg ...)
- valgrind-devel package has to be installed (to have )

Valgrind not printing inline error messages, just heap and leak summaries

I'm using the command:
valgrind --tool=memcheck --leak-check=yes ./prog
When this runs with a test script, I get no inline error messages or warnings, I just get a Heap Summary and a leak summary.
Am I missing a flag or something?
==31420== HEAP SUMMARY:
==31420== in use at exit: 1,580 bytes in 10 blocks
==31420== total heap usage: 47 allocs, 37 frees, 7,132 bytes allocated
==31420==
==31420== 1,580 (1,440 direct, 140 indirect) bytes in 5 blocks are definitely lost in loss record 2 of 2
==31420== at 0x4C274A8: malloc (vg_replace_malloc.c:236)
==31420== by 0x400FD4: main (lab1.c:51)
==31420==
==31420== LEAK SUMMARY:
==31420== definitely lost: 1,440 bytes in 5 blocks
==31420== indirectly lost: 140 bytes in 5 blocks
==31420== possibly lost: 0 bytes in 0 blocks
==31420== still reachable: 0 bytes in 0 blocks
==31420== suppressed: 0 bytes in 0 blocks
==31420==
==31420== For counts of detected and suppressed errors, rerun with: -v
==31420== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 4 from 4)
The last time I used valgrind (a few days ago) it would print out error messages as they occurred, in addition to the heap and leak summaries.
EDIT:
I tried leak-check=full, same result
The line mentioned in the heap summary (lab1.c:51) is:
temp_record = malloc(sizeof(struct server_record));
And I use this pointer pretty often in my code. That is what was so helpful about the valgrind error messages before, they would show me when I would lose my pointer to this malloc or other problems.