Limits of Klee (the LLVM program analysis tool) - code-analysis

http://klee.llvm.org/ is a program analysis tool that works by symbolic execution and constraint solving, finding possible inputs that will cause a program to crash, and outputting these as test cases. It's an extremely impressive piece of engineering that has produced some good results so far, including finding a number of bugs in a collection of open-source implementations of Unix utilities that had been considered among some of the most thoroughly tested software ever written.
My question is: what does it not do?
Of course, any such tool has the inherent limit that it can't read the user's mind and guess what the output was supposed to be. But leaving aside the in principle impossible, most projects don't yet seem to be using Klee; what are the limitations of the current version, what sort of bugs and workloads can it as yet not handle?

As I can say after reading a http://llvm.org/pubs/2008-12-OSDI-KLEE.html
It can't check all possible paths of big program. It failed even on sort utility. The
problem is a halting problem (Undecidable problem), and KLEE is a heuristic, so it is able to check only some of paths in limited time.
It can't work fast, according to paper, it needed 89-hours to generate tests for 141000 lines of code in COREUTILS (with libc code used in them). And the largest single program has only ~10000 lines.
It knows nothing about floating point, longjmp/setjmp, threads, asm; memory objects of variable size.
Update: /from llvm blog/ http://blog.llvm.org/2011/05/what-every-c-programmer-should-know_14.html
5 . The LLVM "Klee" Subproject uses symbolic analysis to "try every possible path" through a piece of code to find bugs in the code and it produces a testcase. It is a great little project that is mostly limited by not being practical to run on large-scale applications.
Update2: KLEE requieres program to be modified. http://t1.minormatter.com/~ddunbar/klee-doxygen/overview.html
. Symbolic memory is defined by inserting special calls to KLEE (namely klee_make_symbolic) During execution, KLEE tracks all uses of symbolic memory.

Overall, KLEE should be a pretty good symbolic execution engine for academic research. For being used in practice, it could be limited by following aspects:
[1] The memory model used by the LLVM IR interpreter in KLEE is memory-consuming and time-consuming. For each execution path, KLEE maintains a private "address space". The address space maintains a "stack" for local variables and a "heap" for global variables and dynamically allocated variables. However, each variable (local or global) is wrapped in a MemoryObject object (MemoryObject maintain metadata related to this variable, such as the starting address, size, and allocation information). The size of memory used for each variable would be the size of the original variable plus the size of MemoryObject object. When an variable is to be accessed, KLEE firstly search the "address space" to locate the corresponding MemoryObject. KLEE would check the MemoryObject and see if the access is legitimate. If so, the access will be completed and state of the MemoryObject will be updated. Such memory access is obviously slower than RAM. Such a design can easily support the propagation of symbolic values. However, this model could be simplified via learning from taint analysis (labeling the symbolic status of variables).
[2] KLEE lacks models for system environments. The only system component modeled in KLEE is a simple file system. Others, such as sockets or multi-threading, are not supported. When a program invoke system calls to these non-modeled components, KLEE either (1) terminate the execution and raises an alert or (2) redirect the call to the underlying OS (Problems: symbolic values need to be concretized and some paths would be missed; system calls from different execution paths would interfere with each other). I suppose this is the reason for "it knowing nothing threads" as mentioned above.
[3] KLEE cannot directly work on binaries. KLEE requires LLVM IR of a to-be-tested program. While other Symbolic Execution tools, such as S2E and VINE from the BitBlaze project can work on binaries. An interesting thing is the S2E project relies on KLEE as its symbolic execution engine.
Regarding the above answer, I personally have different opinions. First, it's true that KLEE cannot perfectly work with large-scope programs, but which symbolic execution tool can? Path explosion is more a theoretical open problem instead of an engineering problem. Second, as I mentioned, KLEE might run slowly due to its memory model. However, KLEE does not slow down the execution for nothing. It conservatively checks memory corruptions (such as buffer overflow) and will log a set of useful information for each executed path (such as constraints on the inputs to follow a path). In addition, I did not know other symbolic execution tools that can run super fast. Third, "floating point, longjmp/setjmp, threads, asm; memory objects of variable size" are just engineering works. For example, the author of KLEE actually did something to enable KLEE to support floating point (http://srg.doc.ic.ac.uk/files/papers/kleefp-eurosys-11.pdf). Third, KLEE does not necessarily require instrumentation over the program to label symbolic variables. As mentioned above, symbolic values can be feed into the program via command lines. In fact, users can also specify files to be symbolic. If required, users can simply instrument the library functions to make inputs symbolic (once for all).

Related

Memory/Address Sanitizer vs Valgrind

I want some tool to diagnose use-after-free bugs and uninitialized bugs. I am considering Sanitizer(Memory and/or Address) and Valgrind. But I have very little idea about their advantages and disadvantages. Can anyone tell the main features, differences and pros/cons of Sanitizer and Valgrind?
Edit: I found some of comparisons like: Valgrind uses DBI(dynamic binary instrumentation) and Sanitizer uses CTI(compile-time instrumentation). Valgrind makes the program much slower(20x) whether Sanitizer runs much faster than Valgrind(2x). If anyone can give me some more important points to consider, it will be a great help.
I think you'll find this wiki useful.
TLDR main advantages of sanitizers are
much smaller CPU overheads (Lsan is practically free, UBsan/Isan is 1.25x, Asan and Msan are 2-4x for computationally intensive tasks and 1.05-1.1x for GUIs, Tsan is 5-15x)
wider class of detected errors (stack and global overflows, use-after-return/scope)
full support of multi-threaded apps (Valgrind support for multi-threading is a joke)
much smaller memory overhead (up to 2x for Asan, up to 3x for Msan, up to 10x for Tsan which is way better than Valgrind)
Disadvantages are
more complicated integration (you need to teach your build system to understand Asan and sometimes work around limitations/bugs in Asan itself, you also need to use relatively recent compiler)
MemorySanitizer is not reall^W easily usable at the moment as it requires one to rebuild all dependencies under Msan (including all standard libraries e.g. libc++); this means that casual users can only use Valgrind for detecting uninitialized errors
sanitizers typically can not be combined with each other (the only supported combination is Asan+UBsan+Lsan) which means that you'll have to do separate QA runs to catch all types of bugs
One big difference is that the LLVM-included memory and thread sanitizers implicitly map huge swathes of address space (e.g., by calling mmap(X, Y, 0, MAP_NORESERVE|MAP_ANONYMOUS|MAP_FIXED|MAP_PRIVATE, -1, 0) across terabytes of address space in the x86_64 environment). Even though they don't necessarily allocate that memory, the mapping can play havoc with restrictive environments (e.g., ones with reasonable settings for ulimit values).

What are common structures for firmware files?

I'm a total n00b in embedded programming. Suppose I'm building a firmware using a compiler. The result of this operation is a file that will be flashed into (I guess) the flash memory of a MCU such an ARM or a AVR.
My question is: What common structures (if any) are used for such generated files containing the firmware?
I came from desktop development and I understand that for example for Windows the compiler will most likely generate a PE or PE+, while Unix-like systems I may get a ELF and COFF, but have no idea for embedded systems.
I also understand that this highly depends on many factors (Compiler, ISA, MCU vendor, OS, etc.) so I'm fine with at least one example.
Update: I will up vote all answers providing examples of used structures and will select the one I feel best surveys the state of the art.
The firmware file is the Executable and Linkable File, usually processed to a binary (.bin) or text represented binary (.hex).
This binary file is the exact memory that is written to the embedded flash. When you first power the board, an internal bootloader will redirect the execution to your firmware entry point, normally at the address 0x0.
From there, it is your code that is running, this is why you have a startup code (usually startup.s file) that will configure clock, stack pointer registers, vector table, load the data section to RAM (your initialized variables), clear the zero initialized section, maybe you will want to copy your code to RAM and jump to the copy to avoid running code from FLASH (can be faster on some platforms), and so on.
When running over an Operational System, all these platform choices and resources are not in control of user code, there you can only link to the OS libraries and use the provided API to do low level actions. In embedded, it is 100% user code, you access the hardware and manage its resources.
Not surprisingly, Operational Systems are booted in a similar manner as firmware, since both are there in touch with the processor, memory and I/Os.
All of that, to say: the structure of a firmware is similar to the structure of any compiled program. There's the data sections and code sections that are organized in memory during the load by the Operational System, or by the program itself when running on embedded.
One main difference is the memory addressing in the firwmare binary, usually addresses are physical RAM address, since you do not have memory mapping feature on most of micro-controllers. This is transparent to the user, the compiler will abstract it.
Other significant difference is the stack pointer, on OSs user code will not reserve memory for the stack by itself, it relays on OS for that. When on firmware, you have to do it in user code for the same reason as before, there's no middle man to manage it for you. The linker script of the compiler will reserve Stack and Heap memory accordingly configured and there will be a stack_pointer symbol on your .map file letting you know where it points to. You won't find it in OSs program's map files.
Most tools output either an ELF, or a COFF, or something similar that can eventually boil down to a HEX/bin file.
That isn't necessarily what your target wants to see, however. Every vendor has their own format of "firmware" files. Sometimes they're encrypted and signed, sometimes plain text. Sometimes there's compression, sometimes it's raw. It might be a simple file, or something complex that is more than just your program.
An integral part of doing embedded work is the build flow and system startup/booting procedure, plus getting your code onto the part. Don't underestimate the effort.
Ultimately the data written to the ROM is normally just the code and constant data from which your application is composed, and therefore has no structure other than perhaps being segmented into code and data, and possibly custom segments if you have created them. The structure in this sense is defined by the linker script or configuration used to build the code. The file containing this code/data may be raw binary, or an encoded binary format such as Intel Hex or Motorola S-Record for example.
Typically also your toolchain will generate an object code file that contains not only the code/data, but also symbolic and debug information for use by a debugger. In this case when the debugger runs, it loads the code to the target (as in the binary file case above) and the symbol/debug information to the host to allow source level debugging. These files may be in a proprietary object file format specific to the toolchain, but are often standard "open" formats such as ELF. However strictly the meta-data components of an object file are not part of the firmware since they are not loaded on the target.
I've recently run across another firmware format not listed here. It's a binary format, might be called ".EEP" but might not. I think it is used by NXP. I've seen it used for ARM THUMB2 and for mystery stuff that may be a DSP/BSP.
All the following are 32-bit values, all stored in reverse endian except for CAFEBABE (so... BEBAFECA?):
CAFEBABE
length_in_16_bit_words(yes, 16-bit...?!)
base_addr
CRC32
length*2 bytes of data
FFFF (optional filler if the length is an odd number)
If there are more data blocks:
length
base
checksum that is not a CRC but something bizarre
data
FFFF (optional filler if the length is an odd number)
...
When no more data blocks remain:
length == 0
base == 0
checksum that is not a CRC but something bizarre
Then all of that is repeated for another memory bank/device.

What decides which structure a process has in memory?

I've learned that a process has the following structure in memory:
(Image from Operating System Concepts, page 82)
However, it is not clear to me what decides that a process looks like this. I guess processes could (and do?) look different if you have a look at non-standard OS / architectures.
Is this structure decided by the OS? By the compiler of the program? By the computer architecture? A combination of those?
Related and possible duplicate: Why do stacks typically grow downwards?.
On some ISAs (like x86), a downward-growing stack is baked in. (e.g. call decrements SP/ESP/RSP before pushing a return address, and exceptions / interrupts push a return context onto the stack so even if you wrote inefficient code that avoided the call instruction, you can't escape hardware usage of at least the kernel stack, although user-space stacks can do whatever you want.)
On others (like MIPS where there's no implicit stack usage), it's a software convention.
The rest of the layout follows from that: you want as much room as possible for downward stack growth and/or upward heap growth before they collide. (Or allowing you to set larger limits on their growth.)
Depending on the OS and executable file format, the linker may get to choose the layout, like whether text is above or below BSS and read-write data. The OS's program loader must respect where the linker asks for sections to be loaded (at least relative to each other, for executables that support ASLR of their static code/data/BSS). Normally such executables use PC-relative addressing to access static data, so ASLRing the text relative to the data or bss would require runtime fixups (and isn't done).
Or position-dependent executables have all their segments loaded at fixed (virtual) addresses, with only the stack address randomized.
The "heap" isn't normally a real thing, especially in systems with virtual memory so each process can have their own private virtual address space. Normally you have some space reserved for the stack, and everything outside that which isn't already mapped is fair game for malloc (actually its underlying mmap(MAP_ANONYMOUS) system calls) to choose when allocating new pages. But yes even modern glibc's malloc on modern Linux does still use brk() to move the "program break" upward for small allocations, increasing the size of "the heap" the way your diagram shows.
I think this is recommended by some committee and then tools like GCC conform to that recommendation. Binary format defines these segments and operating system and its tools facilitate the process of that format to run on system. lets say ELF is recommended by system V and then adopted by unix; and gcc produce the ELF binaries to be run on unix. so i feel story may start from binary format as it decides about memory mappings(code, data/heap/stack). binary format,among other hacks, defines memory mappings to be mapped for loading programs. As for example ELF defines segments (arrange code in text,data,stack to be loaded in memory), GCC generates that segments of ELF binary while loader loads these segments. operating system also has freedom in adjusting the values of these segments like stack size. These are debatable loud thoughts which I try to consolidate.
That figure represents a a specific implementation or an idealized one. A process does not necessarily have that structure. On many systems a process looks only somewhat similar to what is in the diagram.

How does machine code communicate with processor?

Let's take Python as an example. If I am not mistaken, when you program in it, the computer first "translates" the code to C. Then again, from C to assembly. Assembly is written in machine code. (This is just a vague idea that I have about this so correct me if I am wrong) But what's machine code written in, or, more exactly, how does the processor process its instructions, how does it "find out" what to do?
If I am not mistaken, when you program in it, the computer first "translates" the code to C.
No it doesn't. C is nothing special except that it's the most widespread programming language used for system programming.
The Python interpreter translates the Python code into so called P-Code that's executed by a virtual machine. This virtual machine is the actual interpreter which reads P-Code and every blip of P-Code makes the interpreter execute a predefined codepath. This is not very unlike how native binary machine code controls a CPU. A more modern approach is to translate the P-Code into native machine code.
The CPython interpreter itself is written in C and has been compiled into a native binary. Basically a native binary is just a long series of numbers (opcodes) where each number designates a certain operation. Some opcodes tell the machine that a defined count of numbers following it are not opcodes but parameters.
The CPU itself contains a so called instruction decoder, which reads the native binary number by number and for each opcode it reads it gives power to the circuit of the CPU that implement this particular opcode. there are opcodes, that address memory, opcodes that load data from memory into registers and so on.
how does the processor process its instructions, how does it "find out" what to do?
For every opcode, which is just a binary pattern, there is its own circuit on the CPU. If the pattern of the opcode matches the "switch" that enables this opcode, its circuit processes it.
Here's a WikiBook about it:
http://en.wikibooks.org/wiki/Microprocessor_Design
A few years ago some guy built a whole, working computer from simple function logic and memory ICs, i.e. no microcontroller or similar involved. The whole project called "Big Mess o' Wires" was more or less a CPU built from scratch. The only thing nerdier would have been building that thing from single transistors (which actually wasn't that much more difficult). He also provides a simulator which allows you to see how the CPU works internally, decoding each instruction and executing it: Big Mess o' Wires Simulator
EDIT: Ever since I originally wrote that answer, building a fully fledged CPU from modern, discrete components has been done: For your considereation a MOS6502 (the CPU that powered the Apple II, Commodore C64, NES, BBC Micro and many more) built from discetes: https://monster6502.com/
Machine-code does not "communicate with the processor".
Rather, the processor "knows how to evaluate" machine-code. In the [widespread] Von Neumann architecture this machine-code (program) can be thought of as an index-able array of where each cell contains a machine-code instruction (or data, but let's ignore that for now).
The CPU "looks" at the current instruction (often identified by the PC or Program Counter) and decides what to do (this can either be done directly with transistors/"bare-metal", or it be translated to even lower-level code): this is known as the fetch-decode-execute cycle.
When the instructions are executed side-effects occur such as setting a control flag, putting a value in a register, or jumping to a different index (changing the PC) in the program, etc. See this simple overview of a CPU which covers the above a little bit better.
It is the evaluation of each instruction -- as it is encountered -- and the interaction of side-effects that results in the operation of a traditional processor.
(Of course, modern CPUs are very complex and do lots of neat tricky things!)
That's called microcode. It's the code in the CPU that reads machine code instructions and translate that into low level data flow.
When the CPU for example encounters the add instruction, the microcode describes how it should get the two values, feed them to the ALU to do the calculation, and where to put the result.
Electricity. Circuits, memory, and logic gates.
Also, I believe Python is usually interpreted, not compiled through C → assembly → machine code.

How does one use dynamic recompilation?

It came to my attention some emulators and virtual machines use dynamic recompilation. How do they do that? In C i know how to call a function in ram using typecasting (although i never tried) but how does one read opcodes and generate code for it? Does the person need to have premade assembly chunks and copy/batch them together? is the assembly written in C? If so how do you find the length of the code? How do you account for system interrupts?
-edit-
system interrupts and how to (re)compile the data is what i am most interested in. Upon more research i heard of one person (no source available) used js, read the machine code, output js source and use eval to 'compile' the js source. Interesting.
It sounds like i MUST have knowledge of the target platform machine code to dynamically recompile
Yes, absolutely. That is why parts of the Java Virtual Machine must be rewritten (namely, the JIT) for every architecture.
When you write a virtual machine, you have a particular host-architecture in mind, and a particular guest-architecture. A portable VM is better called an emulator, since you would be emulating every instruction of the guest-architecture (guest-registers would be represented as host-variables, rather than host-registers).
When the guest- and host-architectures are the same, like VMWare, there are a ton of (pretty neat) optimizations you can do to speed up the virtualization - today we are at the point that this type of virtual machine is BARELY slower than running directly on the processor. Of course, it is extremely architecture-dependent - you would probably be better off rewriting most of VMWare from scratch than trying to port it.
It's quite possible - though obviously not trivial - to disassemble code from a memory pointer, optimize the code in some way, and then write back the optimized code - either to the original location or to a new location with a jump patched into the original location.
Of course, emulators and VMs don't have to RE-write, they can do this at load-time.
This is a wide open question, not sure where you want to go with it. Wikipedia covers the generic topic with a generic answer. The native code being emulated or virtualized is replaced with native code. The more the code is run the more is replaced.
I think you need to do a few things, first decide if you are talking about an emulation or a virtual machine like a vmware or virtualbox. An emulation the processor and hardware is emulated using software, so the next instruction is read by the emulator, the opcode pulled apart by code and you determine what to do with it. I have been doing some 6502 emulation and static binary translation which is dynamic recompilation but pre processed instead of real time. So your emulator may take a LDA #10, load a with immediate, the emulator sees the load A immediate instruction, knows it has to read the next byte which is the immediate the emulator has a variable in the code for the A register and puts the immediate value in that variable. Before completing the instruction the emulator needs to update the flags, in this case the Zero flag is clear the N flag is clear C and V are untouched. But what if the next instruction was a load X immediate? No big deal right? Well, the load x will also modify the z and n flags, so the next time you execute the load a instruction you may figure out that you dont have to compute the flags because they will be destroyed, it is dead code in the emulation. You can continue with this kind of thinking, say you see code that copies the x register to the a register then pushes the a register on the stack then copies the y register to the a register and pushes on the stack, you could replace that chunk with simply pushing the x and y registers on the stack. Or you may see a couple of add with carries chained together to perform a 16 bit add and store the result in adjacent memory locations. Basically looking for operations that the processor being emulated couldnt do but is easy to do in the emulation. Static binary translation which I suggest you look into before dynamic recompilation, performs this analysis and translation in a static manner, as in, before you run the code. Instead of emulating you translate the opcodes to C for example and remove as much dead code as you can (a nice feature is the C compiler can remove more dead code for you).
Once the concept of emulation and translation are understood then you can try to do it dynamically, it is certainly not trivial. I would suggest trying to again doing a static translation of a binary to the machine code of the target processor, which a good exercise. I wouldnt attempt dynamic run time optimizations until I had succeeded in performing them statically against a/the binary.
virtualization is a different story, you are talking about running the same processor on the same processor. So x86 on an x86 for example. the beauty here is that using non-old x86 processors, you can take the program being virtualized and run the actual opcodes on the actual processor, no emulation. You setup traps built into the processor to catch things, so loading values in AX and adding BX, etc these all happen at real time on the processor, when AX wants to read or write memory it depends on your trap mechanism if the addresses are within the virtual machines ram space, no traps, but lets say the program writes to an address which is the virtualized uart, you have the processor trap that then then vmware or whatever decodes that write and emulates it talking to a real serial port. That one instruction though wasnt realtime it took quite a while to execute. What you could do if you chose to is replace that instruction or set of instructions that write a value to the virtualized serial port and maybe have then write to a different address that could be the real serial port or some other location that is not going to cause a fault causing the vm manager to have to emulate the instruction. Or add some code in the virtual memory space that performs a write to the uart without a trap, and have that code instead branch to this uart write routine. The next time you hit that chunk of code it now runs at real time.
Another thing you can do is for example emulate and as you go translate to a virtual intermediate bytcode, like llvm's. From there you can translate from the intermediate machine to the native machine, eventually replacing large sections of program if not the whole thing. You still have to deal with the peripherals and I/O.
Here's an explaination of how they are doing dynamic recompilation for the 'Rubinius' Ruby interpteter:
http://www.engineyard.com/blog/2010/making-ruby-fast-the-rubinius-jit/
This approach is typically used by environments with an intermediate byte code representation (like Java, .net). The byte code contains enough "high level" structures (high level in terms of higher level than machine code) so that the VM can take chunks out of the byte code and replace it by a compiled memory block. The VM typically decide which part is getting compiled by counting how many times the code was already interpreted, since the compilation itself is a complex and time-consuming process. So it is usefull to only compile the parts which get executed many times.
but how does one read opcodes and generate code for it?
The scheme of the opcodes is defined by the specification of the VM, so the VM opens the program file, and interprets it according to the spec.
Does the person need to have premade assembly chunks and copy/batch them together? is the assembly written in C?
This process is an implementation detail of the VM, typically there is a compiler embedded, which is capable to transform the VM opcode stream into machine code.
How do you account for system interrupts?
Very simple: none. The code in the VM can't interact with real hardware. The VM interact with the OS, and transfer OS events to the code by jumping/calling specific parts inside the interpreted code. Every event in the code or from the OS must pass the VM.
Also hardware virtualization products can use some kind of JIT. A typical use cases in the X86 world is the translation of 16bit real mode code to 32 or 64bit protected mode code to not to be forced to emulate a CPU in real mode. Also a software-only VM replaces jump instructions in the executing code by jumps into the VM control software, which at each branch the following code path for jump instructions scans and them replace, before it jumps to the real code destination. But I doubt if the jump replacement qualifies as JIT compilation.
IIS does this by shadow copying: after compilation it copies assemblies to some temporary place and runs them from temp.
Imagine, that user change some files. Then IIS will recompile asseblies in next steps:
Recompile (all requests handled by old code)
Copies new assemblies (all requests handled by old code)
All new requests will be handled by new code, all requests - by old.
I hope this'd be helpful.
A virtual Machine loads "byte code" or "intermediate language" and not machine code therefore, I suppose, that it just recompiles the byte code more efficiently once it has more runtime data.
http://en.wikipedia.org/wiki/Just-in-time_compilation