How does valgrind work? - valgrind

Can someone provide a quick top level explanation of how Valgrind works? An example: how does it know when memory is allocated and freed?

Valgrind basically runs your application in a "sandbox." While running in this sandbox, it is able to insert its own instructions to do advanced debugging and profiling.
From the manual:
Your program is then run on a synthetic CPU provided by the Valgrind core. As new code is executed for the first time, the core hands the code to the selected tool. The tool adds its own instrumentation code to this and hands the result back to the core, which coordinates the continued execution of this instrumented code.
So basically, valgrind provides a virtual processor that executes your application. However, before your application instructions are processed, they are passed to tools (such as memcheck). These tools are kind of like plugins, and they are able to modify your application before it is run on the processor.
The great thing about this approach is that you don't have to modify or relink your program at all to run it in valgrind. It does cause your program to run slower, however valgrind isn't meant to measure performance or run during normal execution of your application, so this isn't really an issue.

Valgrind is a Dynamic Binary Analysis (DPA) tool that uses Dynamic Binary Instrumentation (DPI) framework to check memory allocation, to detect deadlocks and to profile the applications. DPI framework has its own low level memory manager, scheduler, thread handler and signal handler. Valgrind tool suite includes tool like
Memcheck - tracks the memory allocation dynamically and reports memory leaks.
Helgrind - detects and reports dead locks, potential data races and lock reversals.
Cachegrind - simulates how the application interacts with system cache and provides information about cache misses.
Nulgrind - a simple valgrind that never do any analysis. Used by developers for performance benchmark.
Massif - a tool to analyse the heap memory usage of the application.
Valgrind tool uses disassemble and resynthesize mechanism where it loads the application into a process, disassembles the application code, add the instrumentation code for analysis, assembles it back and executes the application. It uses Just Intime Compiler (JIT) to embed the application with the instrumentation code.
Valgrind Tool = Valgrind Core + Tool Plugin
Valgrind Core disassembles the application code and passes the code fragment to tool plugin for instrumentation. The tool plugin adds the analysis code and assembles it back. Thus, Valgrind provides the flexibility to write our own tool on top of the Valgrind framework. Valgrind uses shadow registers and shadow memory to instrument read/write instructions, read/write system call, stack and heap allocations.
Valgrind provides wrappers around the system call and registers for pre and post callbacks for every system call to track the memory accessed as part of the system call. Thus, Valgrind is a OS abstraction layer between Linux Operating system and client application.
The diagram illustrates the 8 phases of Valgrind :

valgrind sits as a layer between your program and the OS, intercepting calls to the OS requesting memory (de)allocation and recording what is being manipulated before then actually allocating the memory and passing back an equivalent. It's essentially how most code profilers work, except at a much lower level (system calls instead of program function calls).

Here you can find some nice info:
Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation
http://valgrind.org/docs/manual/manual-core-adv.html#manual-core-adv.wrapping
Besides familiarize yourself with LD_PRELOAD.

Valgrind is basically a virtual machine that executes your program. It is a virtual architecture that intercepts each call to allocate/free memory.

Related

Is it possible to set a baseline memory usage in valgrind for leak detection?

Is there a way to tell valgrind from inside my code when to start and when to stop checking for memory leaks?
I am using a legacy testing framework which must link with my testing program in order to run. The framework has memory leaks in it - valgrind shows about 50KB of memory that has not been released, but is reachable via heuristic. This is annoying, because I must keep this number in mind to see how much memory is leaked from my code. It would be a lot more convenient if I could tell valgrind to start collecting memory stats when my first test begins, and stop collecting when the last test is over. Is there an API for it?
valgrind memcheck allows to do a "differential" leak search. The differential leak search reports the delta between the previous leak search and the current situation.
You can do such a differential leak search using monitor commands with vgdb, either from the shell or from gdb. See https://www.valgrind.org/docs/manual/mc-manual.html#mc-manual.monitor-commands.
You can also use the client request VALGRIND_DO_CHANGED_LEAK_CHECK from your program, see https://www.valgrind.org/docs/manual/mc-manual.html#mc-manual.clientreqs.

When to use full system FS vs syscall emulation SE with userland programs in gem5?

Since syscall emulation is much easier to setup, I'm wondering what are the advantages of using the full system emulation when running an userland program.
Or in other words, what interesting aspects are modeled in the full system but not syscall emulation mode, and when are they significant?
It is mentioned in the docs at: http://gem5.org/Splash_benchmarks that full system is
Realistic: you're getting the actual Linux thread scheduler to schedule your threads
Is this the only advantage, or are there any other advantage for users that are optimizing their applications or investigating micro-architecture?
I also suspect that the MMU simulation is another important feature that is only modeled properly in full system mode, and could affect program performance.
Full system mode should be preferred (when it is possible to use it). There are benefits to using it, primarily fidelity in the simulation which is not possible with system call emulation mode. (The kernel interactions with an application can be important depending on the study that a researcher is trying to conduct.) Also, the user does not need to worry about implementing (or debugging) the system call implementation.
With that said, system call emulation mode can be useful under the right conditions. It is faster to run application code because there is no kernel running in the background. There is also no system noise if you want to mitigate it entirely. Arguably, it is easier to bootstrap a new device model as well. You can work on the model without driver support and make magic happen though fake interfaces. (It saves you having to model the bare-metal interface perfectly or having to write your own device driver.)
Your comments about dynamic linking and multi-threading support are related. If dynamic linking is fixed, you should be able to use your system's pthreads library and can forget about linking with m5threads entirely. The pthread library support has existed in the simulator for a while now (the system calls necessary for it to work properly).
However, there's a caveat to the threading implementation. You need to preallocate enough thread contexts at the start of simulation (by invoking with the -n option on the se.py script).
To elaborate, there is no operating system running in the background to schedule threads on the processors. (I use the terms threads and processors very loosely here.) To obviate the scheduling problem, you have to preallocate enough processors so that the threads can be created on calls to clone/execve. There is a constraint that you can never have more threads than processors (unlike a real system where the operating system can schedule them as it pleases).
The configuration scripts probably do not behave how a researcher would want them to behave for a multi-threaded workload. The researcher would need to verify that the caches were configured correctly and that they are sharing certain cache levels like a real machine would do. If the application calls clone/execve many times, it may not be possible to cause the generated configuration to behave realistically.
Your last statement about modeling accelerators is incorrect. The AMD GFX8 model does use system call emulation mode. (Also, we developed a NIC model which was never publicly released.) It involves creating a fake driver and manipulating it through the same ioctl interfaces that a real driver would use. Linux treats everything like a file so the driver is opened through the open system call interface and you can capture it there. There are other things which you might need to do (like map mmio ranges in the configuration), but the driver interface is the main piece. The application interacts with the driver and the driver interacts with the accelerator model.
Advantages of SE:
sometimes easier to setup benchmarks, if all syscalls you need are implemented (see also, see also), and if you have just the right cross compiler, which of course no one has documented properly which one that is.
SE runs Dhrystone about 2x https://github.com/cirosantilli/linux-kernel-module-cheat/tree/00d282d912173b72c63c0a2cc893a97d45498da5#user-mode-vs-full-system-benchmark That benchmark makes no syscalls (except for information before / after the actual benchmark runs)
it is easier to get greater visibility and control of what the application is doing since the kernel is not running in parallel. E.g. stats will be just for the application, GDB will be just for the application: thread-aware gdb for the Linux kernel
Disadvantages of SE:
in practice, harder to setup benchmarks, because it is too fragile / has too many restrictions.
If your content does not work immediately out of the box, it is easier to just create or download a full system image and go for that instead, which is much more reliable.
Here is a sample minimal working Ubuntu setup if you are still interested: How to compile and run an executable in gem5 syscall emulation mode with se.py?
less representative, since no actual OS is running
no dynamic linking for ARM as of June 2018: How to run a dynamically linked executable syscall emulation mode se.py in gem5?
if you want to evaluate an accelerator like a GPU, you will have to create some slightly custom interface for it, since there is no kernel driver running on top the the kernel as usual.
Brandon has pointed out in his answer that this has in fact been done before: https://stackoverflow.com/a/56371006/9160762
So my recommendation is:
try SE first. If it works, great. If it doesn't, try to fix it quickly, since most problems are trivial. Having the SE setup will save you a lot of time over full system, and it is often representative enough.
otherwise, use FS mode. It is just simpler to setup, more representative, and the performance hit is acceptable for most.
You could also use SE first, and then go to FS to further validate only your most important SE results, since FS is slower and you can therefore validate less different setups.

Instrumentation test run failed

I would like to ask a general question,
I am doing automation testing using robotium tool with the help of a tablet which is single processor. While performing some actions my test case is failing like INSTRUMENTATION TEST RUN FAILED DUE TO JAVA.LANG.OUT OF MEMORY error.
What i need is whether the out of memory error depends on the device processor speed also or purely it depends on the app and test code.
Any solutions can help me a lot
The OutOfMemoryError indicates that you've probably run out of heap space in the application. The device's kernel may set the limits on heap, but your problem is probably in your application and test code.
Does your test run out of memory while executing large tests?
You may want to profile your application for Memory Usage and start resolving memory leaks first.
It can also help if your robotium tests don't run for extended periods of time, but is only a band-aid if your application has memory leaks.

why are system calls handled using interrupts?

I have a basic question about the linux system call.
Why are the system calls not handled just like normal function calls and why is handled via software interrupts?
Is it because, there is no linking process performed for user space application with kernel during the build process of user application?
Linking between separately compiled pieces of code is a minor problem. Shared libraries have had a workaround for it for quite some time (relocatable code, export tables, etc). You pay the cost typically just once, when you load the library in the program.
The bigger problem is that you need to switch the CPU from the unprivileged, user mode into the privileged, kernel mode and you need to do it in a controllable way, without letting user code escape and wreck a havoc on the kernel. And that's typically done with special or designated instructions. You may also benefit from automatic interrupt disabling when transitioning into the kernel, which the x86 int instruction can do for you. Most CPUs have something like this instruction and it's a common way of implementing the system call interface, although not the only one.
If you asked about MS-DOS or the original MINIX, both of which ran on the i8086 in the real address mode, where the kernel couldn't protect itself or other programs from anything because all the memory and system resources were accessible to all code, then there would be less reason in using a special instruction like int, there were no two modes, only one, and in that respect int would be largely equivalent to a simple call (far).
Also noteworthy is the fact that CPUs often handle the following 3 types of events in a very similar fashion:
hardware interrupts from I/O devices
exceptions, errors from code execution (e.g. division by 0, page faults, etc)
system calls
That makes using something like the int instruction a natural choice as your entry and exit points in all of the above handlers would be if not fully then largely identical.

How does one use dynamic recompilation?

It came to my attention some emulators and virtual machines use dynamic recompilation. How do they do that? In C i know how to call a function in ram using typecasting (although i never tried) but how does one read opcodes and generate code for it? Does the person need to have premade assembly chunks and copy/batch them together? is the assembly written in C? If so how do you find the length of the code? How do you account for system interrupts?
-edit-
system interrupts and how to (re)compile the data is what i am most interested in. Upon more research i heard of one person (no source available) used js, read the machine code, output js source and use eval to 'compile' the js source. Interesting.
It sounds like i MUST have knowledge of the target platform machine code to dynamically recompile
Yes, absolutely. That is why parts of the Java Virtual Machine must be rewritten (namely, the JIT) for every architecture.
When you write a virtual machine, you have a particular host-architecture in mind, and a particular guest-architecture. A portable VM is better called an emulator, since you would be emulating every instruction of the guest-architecture (guest-registers would be represented as host-variables, rather than host-registers).
When the guest- and host-architectures are the same, like VMWare, there are a ton of (pretty neat) optimizations you can do to speed up the virtualization - today we are at the point that this type of virtual machine is BARELY slower than running directly on the processor. Of course, it is extremely architecture-dependent - you would probably be better off rewriting most of VMWare from scratch than trying to port it.
It's quite possible - though obviously not trivial - to disassemble code from a memory pointer, optimize the code in some way, and then write back the optimized code - either to the original location or to a new location with a jump patched into the original location.
Of course, emulators and VMs don't have to RE-write, they can do this at load-time.
This is a wide open question, not sure where you want to go with it. Wikipedia covers the generic topic with a generic answer. The native code being emulated or virtualized is replaced with native code. The more the code is run the more is replaced.
I think you need to do a few things, first decide if you are talking about an emulation or a virtual machine like a vmware or virtualbox. An emulation the processor and hardware is emulated using software, so the next instruction is read by the emulator, the opcode pulled apart by code and you determine what to do with it. I have been doing some 6502 emulation and static binary translation which is dynamic recompilation but pre processed instead of real time. So your emulator may take a LDA #10, load a with immediate, the emulator sees the load A immediate instruction, knows it has to read the next byte which is the immediate the emulator has a variable in the code for the A register and puts the immediate value in that variable. Before completing the instruction the emulator needs to update the flags, in this case the Zero flag is clear the N flag is clear C and V are untouched. But what if the next instruction was a load X immediate? No big deal right? Well, the load x will also modify the z and n flags, so the next time you execute the load a instruction you may figure out that you dont have to compute the flags because they will be destroyed, it is dead code in the emulation. You can continue with this kind of thinking, say you see code that copies the x register to the a register then pushes the a register on the stack then copies the y register to the a register and pushes on the stack, you could replace that chunk with simply pushing the x and y registers on the stack. Or you may see a couple of add with carries chained together to perform a 16 bit add and store the result in adjacent memory locations. Basically looking for operations that the processor being emulated couldnt do but is easy to do in the emulation. Static binary translation which I suggest you look into before dynamic recompilation, performs this analysis and translation in a static manner, as in, before you run the code. Instead of emulating you translate the opcodes to C for example and remove as much dead code as you can (a nice feature is the C compiler can remove more dead code for you).
Once the concept of emulation and translation are understood then you can try to do it dynamically, it is certainly not trivial. I would suggest trying to again doing a static translation of a binary to the machine code of the target processor, which a good exercise. I wouldnt attempt dynamic run time optimizations until I had succeeded in performing them statically against a/the binary.
virtualization is a different story, you are talking about running the same processor on the same processor. So x86 on an x86 for example. the beauty here is that using non-old x86 processors, you can take the program being virtualized and run the actual opcodes on the actual processor, no emulation. You setup traps built into the processor to catch things, so loading values in AX and adding BX, etc these all happen at real time on the processor, when AX wants to read or write memory it depends on your trap mechanism if the addresses are within the virtual machines ram space, no traps, but lets say the program writes to an address which is the virtualized uart, you have the processor trap that then then vmware or whatever decodes that write and emulates it talking to a real serial port. That one instruction though wasnt realtime it took quite a while to execute. What you could do if you chose to is replace that instruction or set of instructions that write a value to the virtualized serial port and maybe have then write to a different address that could be the real serial port or some other location that is not going to cause a fault causing the vm manager to have to emulate the instruction. Or add some code in the virtual memory space that performs a write to the uart without a trap, and have that code instead branch to this uart write routine. The next time you hit that chunk of code it now runs at real time.
Another thing you can do is for example emulate and as you go translate to a virtual intermediate bytcode, like llvm's. From there you can translate from the intermediate machine to the native machine, eventually replacing large sections of program if not the whole thing. You still have to deal with the peripherals and I/O.
Here's an explaination of how they are doing dynamic recompilation for the 'Rubinius' Ruby interpteter:
http://www.engineyard.com/blog/2010/making-ruby-fast-the-rubinius-jit/
This approach is typically used by environments with an intermediate byte code representation (like Java, .net). The byte code contains enough "high level" structures (high level in terms of higher level than machine code) so that the VM can take chunks out of the byte code and replace it by a compiled memory block. The VM typically decide which part is getting compiled by counting how many times the code was already interpreted, since the compilation itself is a complex and time-consuming process. So it is usefull to only compile the parts which get executed many times.
but how does one read opcodes and generate code for it?
The scheme of the opcodes is defined by the specification of the VM, so the VM opens the program file, and interprets it according to the spec.
Does the person need to have premade assembly chunks and copy/batch them together? is the assembly written in C?
This process is an implementation detail of the VM, typically there is a compiler embedded, which is capable to transform the VM opcode stream into machine code.
How do you account for system interrupts?
Very simple: none. The code in the VM can't interact with real hardware. The VM interact with the OS, and transfer OS events to the code by jumping/calling specific parts inside the interpreted code. Every event in the code or from the OS must pass the VM.
Also hardware virtualization products can use some kind of JIT. A typical use cases in the X86 world is the translation of 16bit real mode code to 32 or 64bit protected mode code to not to be forced to emulate a CPU in real mode. Also a software-only VM replaces jump instructions in the executing code by jumps into the VM control software, which at each branch the following code path for jump instructions scans and them replace, before it jumps to the real code destination. But I doubt if the jump replacement qualifies as JIT compilation.
IIS does this by shadow copying: after compilation it copies assemblies to some temporary place and runs them from temp.
Imagine, that user change some files. Then IIS will recompile asseblies in next steps:
Recompile (all requests handled by old code)
Copies new assemblies (all requests handled by old code)
All new requests will be handled by new code, all requests - by old.
I hope this'd be helpful.
A virtual Machine loads "byte code" or "intermediate language" and not machine code therefore, I suppose, that it just recompiles the byte code more efficiently once it has more runtime data.
http://en.wikipedia.org/wiki/Just-in-time_compilation