What is Dynamic Code Analysis? - code-analysis

What is Dynamic Code Analysis?
How is it different from Static Code Analysis (ie, what can it catch that can't be caught in static)?
I've heard of bounds checking and memory analysis - what are these?
What other things are checked using dynamic analysis?
-Adam

Simply put, static analysis collect information based on source code and dynamic analysis is based on the system execution, often using instrumentation.
Advantages of dynamic analysis
Is able to detect dependencies that are not possible to detect in static analysis. Ex.: dynamic dependencies using reflection, dependency injection, polymorphism.
Can collect temporal information.
Deals with real input data. During the static analysis it is difficult to impossible to know what files will be passed as input, what WEB requests will come, what user will click, etc.
Disadvantages of dynamic analysis
May negatively impact the performance of the application.
Cannot guarantee the full coverage of the source code, as it's runs are based on user interaction or automatic tests.
Resources
There's many dynamic analysis tools in the market, being debuggers the most notorious one. On the other hand, it's still an academic research field. There's many researchers studying how to use dynamic analysis for better understanding of software systems. There's an annual workshop dedicated to dependency analysis.

Basically you instrument your code to analyze your software as it is running (dynamic) rather than just analyzing the software without running (static). Also see this JavaOne presentation comparing the two. Valgrind is one example dynamic analysis tool for C. You could also use code coverage tools like Cobertura or EMMA for Java analysis.
From Wikipedia's definition of dynamic program analysis:
Dynamic program analysis is the
analysis of computer software that is
performed with executing programs
built from that software on a real or
virtual processor (analysis performed
without executing programs is known as
static code analysis). Dynamic program
analysis tools may require loading of
special libraries or even
recompilation of program code.

You asked for a good explanation of "bounds checking and memory analysis" issues.
Our Memory Safety Check tool instruments your application to watch at runtime for memory access errors (buffer overruns, array subscript errors, bad pointers, alloc/free errors). The link contains
a detailed explanation complete with examples. This SO answer shows two programs that have pointers into dead stack frame, and how CheckPointer detects and reports the point of error in the source code
A briefer example: C (and C++) infamously do not check accesses to arrays, to see if the access is inside the bounds of the array. The benefit: well-designed program don't pay the cost of such a check in production mode. The downside: buggy programs can touch things outside the array, and this can cause behavior which is very hard to understand; thus the buggy program is difficult to debug.
What a dynamic instrumentation tool like the Memory Safety Checker does, is associate some metadata with every pointer (e.g., the type of the thing to which the pointer "points", and if it is an array, the array bounds), and then check at runtime, any accesses via pointers to arrays, whether the array bound is violated. The tool modifies the original program to collect the metadata where it is generated (e.g., on entry to scopes in which arrays are declared, or as the result of a malloc operation, etc.) and modifies the program at every array reference (written both as x[y] where either x or y is an array pointer and the the value is some type of integral type, similarly for *(x+y)!) to check the access. Now if the program runs, and performs an out-of-bounds access, the check catches the error and it reported at the first place where it could be detected. [If you think about it, you'll realize the instrumentation for metadata collection and checking has to be pretty clever, to handle all the variant cases a language like C may have. Its actually hard to make this work completely).
The good news is that now such access is reported early where it is easier to detect the problem and fix the program. Such a tool isn't intended production use; one uses during development and testing to help verify absence of errors. If there are no errors discovered, then one does a normal compile and runs the programs without the checks.
This is an extremely good example of a dynamic analysis tool: the testing happens at runtime.

Bounds checking
This means runtime checks of array accesses. Contrary to C's laissez-faire approach to memory accesses and pointer arithmetic, other languages like Java or C# actually check whether or not a given array has the element one is trying to access.

Related

Analyzing the speed-up of Oracle's HotSpot versus other compilation techniques

I'm currently working on a project that must involve research of JIT techniques. I'm a complete beginner when it comes to anything related to compilers but I did some research and learned about Java's Hotspot VM. I was hoping to do an analysis on the benefits (or downsides) of using Hotspot versus traditional compilers (for example, g++).
My initial idea was to create some sort of simple program that can be run through both compilers in order to compare compilation times but this brought up a number of questions:
From my understanding, Java source code is initially turned into bytecode by the javac compiler (creating .class files) and then, in turn, this bytecode can be run through HotSpot at runtime to execute the program. Given this, would it even be relevant to compare results with a traditional compiler that converts sources directly to machine code?
Another concern I'm facing is that the programs would be in different languages (ex. C++ vs Java). Although the functionality would be identical, could this skew results when attempting to compare?
Moving on, if the above two points are not a problem, my main questions is:
How can I actually go about benchmarking the speed-up in one method versus the other?
I did some brief research about this but all I was able to find were ways to measure the efficiency of the program itself, not the compilation technique used to run it. Is what I'm trying to do possible? Are there methods to actually analyze the speed up of one compiler over another?
Any help is appreciated!
How can I actually go about benchmarking the speed-up in one method versus the other?
You first need to consider what you actually intend to measure. In other words, saying "the speed-up" is not sufficiently rigorous.
Are we talking about CPU cycles spent compiling? Or walltime from source code to running program? Or peak performance of a few critical methods in a micro benchmark? Overall steady-state program performance? Speed of program initialization? ...
In the end you're comparing two systems that made quite different trade-offs. You can find a few roughly comparable benchmarks already mentioned in the comments but in the end they mostly represent a specific type of throughput-bound tasks and not large applications. It's not like you can find an application such as firefox written both in C and Java with identical feature sets and comparable code quality. So any comparison you do will be incomplete because you'll have to use some limited proxy measurement of how comparable two code-bases are when you compare them.

Test-Automation using MetaProgramming

i want to learn test automation using meta programming.i googled it could not find any thing.can anybody suggest me some resources where can i get info about "how to use Meta Programming for making test automation easy"?
That's a broad topic and not a lot has been written about it, because of the "dark corners" of metaprogramming.
What do you mean by "metaprogramming"?
As background, I consider metaprogramming to be any activity in which a tool (which we call a "metaprogramming tool") is used to inspect or modify the application software to achieve some effect.
Many people consider "reflection" to be a kind of metaprogramming; other consider (C++-style) templates to be metaprogramming; some suggest aspect-oriented programming.
I sort of agree but think these are weak versions of what you want, because each has severe limits on what it can see or do to source code. What you really want is a metaprogramming tool that has access to everything in your source program (yes, comments too!) Such tools are called Program Transformation Systems (PTS); they work by parsing the source code and operating on the parsed representation of the program. (I happen to build one of these, see my bio). PTSes can then analyze the code accurate, and/or make reliable changes to the code and regenerate valid source with the changes. PS: a PTS can implement all those other metaprogramming techniques as special cases, so it is strictly more general.
Where can you use metaprogramming for testing?
There are at least 2 areas in which metaprogramming might play a role:
1) Collection of information from tests
2) Generation of tests
3) Avoidance of tests
Collection.
Collection of test results depends on the nature of tests. Many tests are focused on "is this white/black box functioning correctly"? Assuming the tests are written somehow, they have to have access to the box under test,
be able to invoke that box in a realistic ways, determine if the result is correct, and often tabulate the results to that post-testing quality assessments can be made.
Access is the first problem. The black box to be tested may not be easily accessible to a testing framework: driven by a UI event, in a non-public routine, buried deep inside another function where it hard to get at.
You may need metaprogramming to "temporarily" modify the program to provide access to the box that needs testing (e.g., change a Private method to Public so it can be called from outside). Such changes exist only for the duration of the test project; you throw the modified program away because nobody wants it for anything but the test results. Yes, you have to ensure that the code transformations applied to make things visible don't change the program functionality.
The second problem is exercising the targeted black box in a realistic environment. Each code module runs in a world in which it assumes data and the environment are "properly" configured. The test program can set up that world explicitly by making calls on lots of the program elements or using its own custom code; this is usually the bulk of a test routine, and this code is hard to write and fragile (the application under test keeps changing; so do its assumptions about the world). One might use metaprogramming to instrument the application to collect the environment under which a test might need to run, thus avoiding the problem of writing all the setup code.
Finally, one might want to record more than just "test failed/passed". Often it is useful to know exactly what code got tested ("test coverage"). One can instrument the application to collect what-got-executed data; here's how to do it for code blocks: http://www.semdesigns.com/Company/Publications/TestCoverage.pdf using a PTS. More sophisticated instrumentation might be used to capture information about which paths through the code have been executed. Uncovered code, and/or uncovered paths, show where tests have not been applied and you arguably know nothing about what the program does, let alone whether it is buggy in a straightforward way.
Generation of tests
Someone/thing has to produce tests; we've already discussed how to produce the set-up-the-environment part. What about the functional part?
Under the assumption that the program has been debugged (e.g, already tested by hand and fixed), one could use metaprogramming to instrument the code to capture the results of execution of a black box (e.g., instance execution post-conditions). By exercising the program, one can then produce (by definition) "correctly produces" results which can be transformed into a test. In this way, one might construct a huge variety of regression tests for an existing program; these will be valuable in verifying the further enhancements to the program don't break most of its functionality.
Often a function has qualitatively different behaviors on different ranges of input (e.g., for x<10, produced x+1, else produces x*x). Ideally one would like to provide a test for each qualitively different results (e.g, x<10, x>=10) which means one would like to partition the input ranges. Metaprogrammning can help here, too, by enumerating all (partial) paths through module, and providing the predicate that controls each path.
The separate predicates each represent the input space partition of interest.
Avoidance of Tests
One only tests code one does not trust (surely you aren't testing the JDK?) Any code consructed by a reliable method doesn't need tests (the JDK was constructed this way, or at least Oracle is happy to have you beleive it).
Metaprogramming can be used to automatically generate code from specifications or DSLs, in relaible ways. Such generated code is correct-by-construction (we can argue about what degree of rigour), and doesn't need tests. You might need to test that DSL expression achieves the functionaly you desired, but you don't have to worry about whether the generated code is right.

static and dynamic code analysis

I found several questions about this topic, and all of them with lot of references, but still I don't have a clear idea about that, because most of the references speak about concrete tools and not about the concept in general of the analysis. Thus I have some questions:
About Static analysis:
1. I would like to have a reference, or a summary of which techniques are successful and have more relevance nowadays.
2. What really can they do about discovering bugs, can we make a summary or it is depending of the tool?
About symbolic execution:
1. Where could be enclose symbolic execution? I guess depending of the approach,
I would like to know if they are dynamic analysis, or mix of static and dynamic analysis if it is possible to determine.
I found problems to differentiated the two different techniques in the tools, even I think I know the theoretical difference.
I'm actually working with C
Thanks in advance
I'm trying to give a short answer:
Static analysis looks at the syntactical structure of code and draws conclusions about the program behavior. These conclusions must not always be correct.
A typical example of static analysis is data flow analysis, where you compute sets like used, read, write for every statement. This will help to find e.g. uninitialized values.
You can also analyze the code regarding code-patterns. This way, these tools can be used to check if you are complying to a specific coding standard. A prominent coding standard example is MISRA. This coding standard is used for safety critical systems and avoids problematic constructs in C. This way you can already say a lot about the robustness of your applications against memory leaks, dangling pointers, etc.
Dynamic analysis is not looking at the syntax only, but takes state information into account. In symbolic execution, you are adding assumptions about the possible values of all variables to the statements.
The most expensive and powerful method of dynamic analysis is model checking, where you really look at all possible execution states of the system. You can think of a model checked system as a system that is tested with 100% coverage - but there are of course a lot of practical problems that prevent real systems to be checked that way.
These methods are very powerful, and you can gain a lot from the static code analysis tools especially when combined with a good coding standard.
A feature my software team found really impressive is e.g. that it will tell you in C++ when a class with virtual methods does not have a virtual destructor. Easy to check in fact, but really helpful.
The commercial tools are very expensive, but worth the money, once you learned how to use them. A typical problem in the beginning is that you will get a lot of false alarms, and don't know where to look for the real problem.
Note that nowadays g++ has some of this stuff already built-in, and that you can use something like pclint which is free.
Sorry - this is already getting quite long...hope it's interesting.
The term "static analysis" means that the analysis does not actually run a code. On the other hand, "dynamic analysis" runs a code and also requires some kinds of real test inputs. That is the definition. Nothing more.
Static analysis employs various formal methods such as abstract interpretation, model checking, and symbolic execution. In general, abstract interpretation or model checking is suitable for software verification. Symbolic execution is more appropriate for the purpose of bug finding.
Symbolic execution is categorized into static analysis. However, there is a hybrid method called concolic execution which uses both symbolic execution and dynamic testing.
Added for Zane's comment:
Maybe my explanation was little confusing.
The difference between software verification and bug finding is whether the analysis is sound or not. For example, when we say the buffer overrun analyzer is sound, it means that the analyzer must report all possible buffer overruns. If the analyzer reports nothing, it proves the absence of buffer overruns in the target program. Because model checking is the method that guarantees soundness, it is mostly used for software verification.
On the other hands, symbolic execution which is actively used by today's most commercial static analyzers does not guarantee soundness since sound analysis inherently issues lots, lots of false positives. For the purpose of bug finding, it is more important to reduce false positives even if some true positives are also lost.
In summary,
soundness: there are no false negatives
completeness: there are no false positives
software verification: soundness is more important than completeness
bug finding: completeness is more important than soundness

What are motivations behind compiling to byte-code?

I'm working on my own toy programming language. For now I'm interpreting the source language from AST and I'm wondering what advantages compiling to a byte-code and then interpreting it could provide me.
For now I have three things in mind:
Traversing the syntax tree hundreds of time may be slower than running instructions in an array, especially if the array support O(1) random access(ie. jumping 10 instructions up and down).
In typed execution environment, I have some run-time costs because my AST is typed, and I'm constantly traversing it(ie. I have 10 types of nodes and I need to check what type I'm on now to execute). Maybe compiling to an untyped byte-code could help to improve this, since after type-checking and compiling, I would have an untyped values and code.
Compiling to byte-code may provide better portability.
Are my points correct? What are some other motivations behind compiling to bytecode?
Speed is the main reason; interpreting ASTs is just too slow in practice.
Another reason to use bytecode is that it can be trivially serialized (stored on disk), so that you can distribute it. This is what Java does.
The point of generating byte code (or any other "easily interpreted" form such as threaded code) is essentially performance.
For an AST intepreter to decide what to do next, it needs to traverse the tree, inspect nodes, determine the type of nodes, check the type of any operands, verify legality, and decide which special case of the AST-designated operator applies (it says "+", but it means 16 bit add or string concatenate?), before it finally performs some action.
If one takes the final action and generates some kind of easily interpreted structure, then at "execution" time the interpreter can focus simply on performing actions without all that checking/special-case determination.
Another recent excuse is that if you generate byte code for any of a number of well-known virtual machines (JVM, MSIL, Parrot, etc.) you don't even have to code the interpreter. For the JVM and MSIL, you also get the benefit of the JIT compilers associated with them, and with careful design of your language, compatibility with huge libraries, which are the real attraction of Java and C#.

How do you organize code in embedded projects?

Highly embedded (limited code and ram size) projects pose unique challenges for code organization.
I have seen quite a few projects with no organization at all. (Mostly by hardware engineers who, in my experience are not typically concerned with non-functional aspects of code.)
However, I have been trying to organize my code accordingly:
hardware specific (drivers, initialization)
application specific (not likely to be reused)
reusable, hardware independent
For each module I try to keep the purpose to one of these three types.
Due to limited size of embedded projects and the emphasis on performance, it is often keep this organization.
For some context, my current project is a limited DSP application on a MSP430 with 8k flash and 256 bytes ram.
I've written and maintained multiple embedded products (30+ and counting) on a variety of target micros, including MSP430's. The "rules of thumb" I have been most successful with are:
Try to modularize generic concepts as much as possible (e.g. separate driver code from application code). -- It makes for easier maintenance and reuse/porting of a project to another target micro in the future.
DO NOT start by worrying about optimized code at the very beginning. Try to solve the domain's problem first and optimize second. -- Your target micro can handle a lot more "stuff" than you might expect.
Work to ensure readability. Although most embedded projects seem to have short development-cycles, the projects often live longer than you might expect and another developer will undoubtedly have to work with your code.
I've worked on 8-bit PIC processors with similar limitations.
One restriction you don't have is how many comments you make or what you choose to name your methods, variables, etc.. Take advantage. Speed and size constraints do sometimes trump organization, but you can always explain.
Another tip is to break up a logical source file into even more pieces than you need, then bind them by #includeing them in a compilation unit. This allows you to have lots of reusable code (even one routine per file) but combine in whatever order you need. This is useful e.g. when trying to meet compilation unit size restrictions, or to pick and choose which common subroutines you need on the next project.
I try to organize it as if I had unlimited RAM and ROM, and it usually works out fine. As mentioned elsewhere, do not try to optimize it until you absolutely need to.
If you can get a pin-compatible processor that has more resources, it's better to get it working on that, concentrating on good structure and layout, then optimize for size later when you understand the code better.
Except under exceptional circumstances (see note), the organisation of your code will have no impact on the final product. (contents of the code are obviously a different matter)
So with that in mind you should organise your code as you would any other project.
With that said, the following are fairly typical:
If this is a processor that you've worked on before, or will be working on in the future, you will usually want to keep a dedicated hardware abstraction layer that can be shared between projects in the future. Typically this module would contain items like routines for managing any uarts, timers etc.
Usually it's reasonable to maintain a set of platform specific code for initialisation and setup that performs all of the configuration and initialisation up to the point where your executive takes over and runs your application. It will also include platform specific hal routines.
The executive/application is probably maintained as a separate module. All of the hardware specific code should be hidden in the hal (as mentioned above).
By splitting your code up like this you also have the option of compiling and running your application as a simulation, on a completely different platform, just by replacing the hardware specific code with routines that mimic the hardware.
This can be good for unit testing and debugging and algorithmic problems you might have.
Exceptional circumstances as might be imposed by unusual compiler restrictions. eg. I've come across some compilers that expect all interrupt service routines to be compiled within a single object file.
I've worked with some sensors like the Tmote Sky, I too have seen poor organization, and I have to admit i have contributed to it. Anyway I'd say that some confusion has to be, because loading too much modules or too much part of program will be (imho) resource killing too, so try to be aware of a threshold between organization and usability on the low resources.
Obviously this don't mean let caos begin, but for example try to get a look on the organization of the tinyOS source code and applications, it's an idea on what I'm trying to say.
Although it is a bit painful, one organization technique that is somewhat common with embedded C libraries is to split every single function and variable into a separate C source file, and then aggregate the resulting collection of O files into a library file.
The motivation for doing this is that for most normal linkers the unit of linkage is an object, for every object you either get the whole object or none of it. Since there is a 1-1 relationship between C files and object files, putting each symbol in it's own C file gives each one it's own object. This in turn lets the linker pull in only that subset of functions and variables that are actually used.
This sort of game doesn't help at all for headers they can happily be left as single files.