Calculate cyclomatic complexity at run time for generated program tree - code-analysis

I am running an evolutionary algorithm that automatically generates S-expressions that represent an abstract syntax tree. From there I generate C code to create a compilable program.
For each generated expression I need to calculate the cyclomatic complexity to be used in the fitness calculation. I have noticed that there are tools available to do so (such as the metrics Eclipse plugin), but I was hoping for something that could analyze a more generic program representation.
I could see calling an external tool, however I think that would significantly increase my execution time. Is there a simple way to calculate cyclomatic complexity via some sort of formula that takes into account S-expressions or abstract syntax trees?

If your generated programs are goto-free (e.g., "structured" programs), you can compute cyclomatic complexity by using a simple rule:
CC(p) = #conditionals +1
This is a fact for structured programs. You only need the full definition if your programs are a spaghetti control flow tangle.
Note that conditionals should count while and for loops (as they contain conditionals) and the expression short-circuit branching operators && and ||.

Related

Algebraic/implicit loops handling by Gekko

I have got a specific question with regards to algebraic / implicit loops handling by Gekko.
I will give examples in the field of Chemical Engineering, as this is how I found the project and its other libraries.
For example, when it comes to multicomponent chemical equilibrium calculations, it is not possible to explicitly work out the equations, because the concentration of one specie may be present in many different equations.
I have been using other paid software in the past and it would automatically propose a resolution procedure based on how the system is solvable (by analyzing dependency and creating automatic algebraic loops).
My question would be:
Does Gekko do that automatically?
It is a little bit tricky because sometimes one needs to add tear variables and iterate from a good starting value.
I know this message may be a little bit abstract, but I am trying to decide which software to use for my work and this is a pragmatic bottle neck that I have happened to find.
Thanks in advance for your valuable insight.
Python Gekko uses a simultaneous solution strategy so that all units are solved together instead of sequentially. Therefore, tear variables are not needed but large flowsheet problems with recycle can be difficult to converge to a feasible solution. Below are three methods that are in Python Gekko to assist in efficient solutions and initialization.
Method 1: Intermediate Variables
Intermediate variables are useful to decrease the complexity of the model. In many models, the temporary variables outnumber the regular variables. This model reduction often aides the solver in finding a solution by reducing the problem size. Intermediate variables are declared with m.Intermediates() in Python Gekko. The intermediate variables may be defined in one section or in multiple declarations throughout the model. Intermediate variables are parsed sequentially, from top to bottom. To avoid inadvertent overwrites, intermediate variable can be defined once. In the case of intermediate variables, the order of declaration is critical. If an intermediate is used before the definition, an error reports that there is an uninitialized value. Here is additional information on Intermediates with an example problem.
Method 2: Lower Block Triangular Decomposition
For large problems that have trouble with initialization, there is a mode that is activated with the option m.options.COLDSTART=2. This mode performs a lower block triangular decomposition to automatically identify independent blocks that are then solved independently and sequentially.
This decomposition method for initialization is discussed in the PhD dissertation (chapter 2) of Mostafa Safdarnejad or also in Safdarnejad, S.M., Hedengren, J.D., Lewis, N.R., Haseltine, E., Initialization Strategies for Optimization of Dynamic Systems, Computers and Chemical Engineering, 2015, Vol. 78, pp. 39-50, DOI: 10.1016/j.compchemeng.2015.04.016.
Method 3: Automatic Model Reduction
Model reduction requires more pre-processing time but can help to significantly reduce the solver time. There is additional documentation on m.options.REDUCE.
Overall Strategy for Initialization
The overall strategy that we use for initializing hard problems, such as flowsheets with recycle, is shown in this flowchart.
Sometimes it does mean breaking recycles to get an initialized solution. Other times, the initialization strategies detailed above work well and no model rearrangement is necessary. The advantage of working with a simultaneous solution strategy is degree of freedom swapping such as downstream variables can be fixed and upstream variables calculated to meet that value.

How to translate computation in index notation into sequence of SIMD ops in general case?

UPD: the question in it's original form is poorly formulated because I strongly confuse terminology (SIMD vs vectorized computations) and give too broad example that does not specify exactly what is the problem; I voted to close it with "unclear what you're asking", I'll link a better-formulated question above whenever it appears
In mathematics, one would usually describe n-dimensional tensor computation using index notation, that would look something like:
A[i,j,k] = B[k,j] + C[d[k],i,B[k,j]] + d[k]*f[j] // for 0<i<N, 0<j<M, 0<k<K
but if we want to use any SIMD library to efficiently parallelize that computation (and take advantage of linear-algebraic magic), we would have to express it using primitives from BLAS, numpy, tensorflow, OpenCL, ... that is often quite tricky.
Expressions in [Einstein notation][1] like A_ijk*B_kj are generally solved via [np.einsum][2] (using tensordot, sum and transpose, I guess?). Summation and other element-wise ops are also okay, "smart" indexing is quite tricky, though (especially, if an index appears more then single time in the expression).
I wonder if there any language-agnostic libraries that take an expression in certain form (lets say, form above) and translates it into some Intermediate Representation that can be efficiently executed using existing linear-algebra libraries?
There are libraries that attempt to parallelize loop computations (user API usually looks like #pragma in C++ or #numba.jit in python), but I'm asking about slightly different thing: translate abritary expression in form above into a finite sequence of SIMD commands, like elementwise-ops, matvecs, tensordots and etc.
If there are no language-agnostic solutions yet, I am personally interested in numpy computations :)
Further questions about the code:
I see B[k,j] is used an an index and as a value. Is everything integer? If not, which parts are FP, and where does the conversion happen?
Why does i not appear in the right hand side? Is the same data repeated N times?
Oh yikes, so you have a gather operation, with indices coming from d[k] and B[k,j]. Only a few SIMD instruction sets support this (e.g. AVX2).
I mostly manually vectorize stuff in C, with Intel's x86 intrinsics, (or auto-vectorization and check the compiler's asm output to make sure it didn't suck), so IDK if there's any kind of platform-independent way to express that operation.
I wouldn't expect that many cross-platform SIMD languages would provide a gather or anything built on top of a gather. I haven't used numpy though.
I don't expect you'd find a BLAS, LAPACK, or other library function that includes a gather, unless you go looking for implementations of this exact problem.
With an efficient gather (e.g. Intel Skylake or Xeon Phi), it might vectorize ok if you use SIMD in the loop over j, so you load a whole vector at once from B[], and from f[], and use it with a vector holding d[k] broadcast to every position. You probably want to store a transposed result matrix, like A[i][k][j], so the final store doesn't have to be a scatter. You definitely need to avoid looping over k in the inner-most loop, since that makes loads from B[] non-contiguous, and you have d[k] instead of f[j] varying inside the inner loop.
I haven't done much with GPGPU, but they do SIMD differently. Instead of short vectors like CPUs use, they have effectively many scalar processors ganged together. OpenCL or CUDA or whatever other hot new GPGPU tech might handle your gathers much more efficiently.
SIMD commands, like elementwise-ops, matvecs, tensordots and etc.
When I think of "SIMD commands", I think of x86 assembly instructions (or ARM NEON, or whatever), or at least C / C++ intrinsics that compile to single instructions. :P
A matrix-vector product is not a single "instruction". If you used that terminology, every function that processes a buffer would be "a SIMD instruction".
The last part of your question seems to be asking for a programming-language independent version of numpy, for gluing together high-performance library functions. Or were you thinking that there might be something that would inter-optimize such operations, so you could write something that would compile to a vectorized loop that did stuff like use each input more than once without having to reload it in separate library calls?
IDK if there's anything like that, other than normal C compiler auto-vectorization of loops over arrays.

Automatically create test cases for web page?

If someone has a webpage, the usual way of testing the web site for user interaction bugs is to create each test case by hand and use selenium.
Is there a tool to create these testcases automatically? So if I have a webpage that gets altered, new test cases get created automatically?
You can look at a paid product. That type of technology is not being developed as open source and will probably cost a bit. Some of the major test tools get closer to this, but full auto I have not heard of.
If this was the case the role of QA Engineer and especially Automation Engineer would not be as important and the jobs would spike downwards pretty quickly. I would imagine that if such a tool was out there that it would be breaking news to the entire industry and be world wide.
If you go down the artificial intelligence path this is possible in theory and concept, however, usually artificial intelligence development efforts costs more than the app being developed that needs the testing, so...that's not going to happen.
The best to do at this point is separate out as much of the maintenance into a single section from the rest so you limit the maintenance headache when modyfying and keep a core that stays the same. I usually focus on control manipulation as generic and then workflow and specific maps and data change. That will allow it to function against any website...but you still have to write/update the tests and maintain the maps.
I think Growing Test Cases Automatically is more of what your asking. To be more specific I'll try to introduce basics and if you're interested take a closer look at Evolutionary Testing
Usually there is a standard set of constraints we meet like changing functionality of the system under test (SUT), limited timeframe, lack of appropriate test tools and the list goes on… Yet there is another type of challenge which arises as technological solutions progress further – increase of system complexity.
While the typical constraints are solvable through different technical and management approaches, in the case of system complexity we are facing the limit of our capability of defining a straight-forward analytical method for assessing and validating system behavior. Complex system consist of multiple, often heterogeneous components which when working together amplify each other’s statistical and behavioral deviations, resulting in a system which acts in ways that were not part of its initial design. To make matter worse, complex systems increase sensitivity to their environment as well with the help of the same mechanism.
Options for testing complex systems
How can we test a system which behaves differently each time we run a test scenario? How can we reproduce a problem which costs days and millions to recover from, but happens only from time to time under conditions which are known just approximately?
One possible solution which I want to focus on is to embrace our lack of knowledge and work with whatever we have by using evolutionary testing. In this context the evolutionary testing can be viewed as a variant of black-box testing, because we are working with feeding input into and evaluating output from a SUT without focusing on its internal structure. The fine line here is that we are organizing this process of automatic test case generation and execution on a massive scale as an iterative optimization process which mimics the natural evolution.
Evolutionary testing
Elements:
• Population – set of test case executions, participating into the optimization process
• Generation – the part of the Population, involved into given iteration
• Individual – single test case execution and its results, an element from the Population
• Genome – unified definition of all test cases, model describing the Population
• Genotype – a single test case instance, a model describing an Individual, instance of the Genome
• Recombination – transformation of one or more Genotypes into a new Genotype
• Mutation – random change in a Genotype
• Fitness Function – formalized criterion, expressing the suitability of the Individual against the goal of the optimization
How we create these elements?
• Definition of the experiment goal (selection criteria) – sets the direction of the optimization process and is related to the behavior of the SUT. Involves certain characteristics of SUT state or environment during the performed test case experiments. Examples:
o “SUT should complete the test case execution with an error code”
o “The test case should drive the SUT through the largest number of branches in SUT’s logical structure”
o “Ambient temperature in the room where SUT is situated should not exceed 40 ºC during test case execution”
o “CPU utilization on the system, where SUT runs should exceed 80% during test case execution”
Any measurable parameters of SUT and its environment could be used in a goal statement. Knowledge of the relation between the test input and the goal itself is not obligatory. This gives a possibility to cover goals which are derived directly from requirements, rather than based on some late requirement derivative like business, architectural or technical model.
• Definition of the relevant inputs and outputs of the tested system – identification of SUT inputs and outputs, as well as environment parameters, relevant to the experiment goal.
• Formal definition of the experiment genome – encoding the summarized set of test cases into a parameterized model (usually a data structure), expressing relevant SUT input data, environment parameters and action sequences. This definition also needs to comply with the two major operations applied over genome instances – recombination and mutation. The mechanism for those two operations can be predefined for the type of data or action present in the genome or have custom definitions
• Formal definition of the selection criteria (fitness function) – an evaluation mechanism which takes SUT output or environment parameters resulting from a test case execution (Individual) and calculates a number (Fitness), signifying how close is this particular Individual to the experiment goal.
How the process works?
We use the Genome to create a Generation of random Genotypes (test case instances).
We execute the test cases (Genotypes) generating results (Individuals)
We evaluate each execution result (Individual) against our goal using the Fitness Function
We select only those Individuals from given Generation which have Fitness above a given threshold (the top 10 %, above the average, etc.)
We use the selected individuals to produce a new, full Generation set by applying Recombination and Mutation
We repeat the process, returning on step 2
The iteration process usually stops by setting a condition with regard to the evaluated Fitness of a Generation. For example:
• If the top Fitness hasn’t changed with more than 0.1% since the last Iteration
• If the difference between the top and the bottom Fitness in a Generation is less than 0.3%
then probably it is time to stop.
Upsides and downsides
Upsides:
• We can work with limited knowledge for the SUT and goal-oriented test definitions
• We use a test case model (Genome) which allows us to mass-produce a large number of test cases (Genotypes) with little effort
• We can “seed” test cases (Genotypes) in the first iteration instead of generating them at random in order to speed up the optimization process.
• We could run test cases in parallel in order to speed up the process
• We could find multiple solutions which meet our test goal
• If the optimization process in convergent we have a guarantee that each following Generation is a better approximate solution of our test goal. This means that even if we need to stop before we have reached optimal Fitness we will still have better test cases than the one we started with.
• We can achieve replay of very complex, hard to reproduce test scenarios which mimic real life and which are far beyond the reach of any other automated or manual testing technique.
Downsides:
• The process of defining the necessary elements for evolutionary test implementation is non-trivial and requires specific knowledge.
• Implementing such automation approach is time- and resource-consuming and should be employed only when it is justifiable.
• The convergence of the optimization process depends on the smoothness of the Fitness Function. If its definition results in a zones of discontinuity or small/no gradient then we can expect slow or no convergence
Update:
I also recommend you to look at Genetic algorithms and this article about Test data generation can give you approaches and guidelines.
I happen to develop ecFeed - an open-source tool that may assist in test design. It's in pre-release phase and we are going to add better integration with Selenium, but you may have a look at the current snapshot: https://github.com/testify-no/ecFeed/wiki . The next version should arrive in October and will have major improvements in usability. Anyway, I am looking forward for constructive criticism.
In the Microsoft development world there is Visual Studio's Coded UI Test framework. This will record your actions in a web browser and generate test cases to replicate that use case. It won't update test cases with any changes to code though, you would need to update them manually or re-generate.

Software where equations are the principal objects of interest?

I'm working on a program where the main entities are mathematical equations.
Equations can only return a double or a boolean result.
The problems:
1- Lots of equations. (~300 per file)
2- There has to be a computations-log at the end, so every equation should somehow be able to log itself.
3- It has to be fast, as fast as possible, because those few hundred equations could be triggered a million times. (Think of it as a big optimization loop).
4- I want to enforce a certain order of appearance for the logged equations that is not necessarily similar that of the code.
Currently, I'm using C. (or C, with a little bit of C++), and writing every equation as a function-like macro. I'm wondering though if this is the right way to go. Has this kind of problems been tackled before? Is there some other language that's better suited to this than C? And is there any design patterns or practices that I need to be aware of for this specific class of problems?
So the equations are turning into compiled code, not interpreted?
If so, that's the fastest.
However, if your logging involves I/O, that might be the biggest time-taker, which you can determine for sure by turning it off and comparing execution time.
If so, possible ways around it are:
generating output in binary, rather than formatting numbers.
not writing any more than you have to, like events rather than long records.
trying not to be rotary-disk-bound, like writing to a memory file or a solid-state disk.

Do lex and yacc provide optimized code?

Do Lex and Yacc provide optimized code or is it required that we write our own code manually for higher performance?
The code you write has a substantial effect on the speed, especially on the lexer side. Some versions of Flex come with half a dozen (or so) different word counters, most written with Flex, and a few written by hand -- the code gives a pretty good idea of how to optimize scanning speed (and a fairly reasonable comparison of what you can expect in hand-written vs. machine generated lexers).
On the parser side, you're generally a bit more constrained -- you can't make as many changes without affecting the semantics. Here, a great deal depends on the parser generator you use -- for example, some algorithms for less constrained grammars consistently produce relatively slow parsers, while other algorithms only slow down for the less constrained constructs, and run relatively fast as long as the input doesn't use the more complex constructs.
Yacc produces a table-driven parser, which can never be as fast as a well-written hand-coded one. I don't know if the same applies to lex.