How to do Static analysis with ANTLR - antlr

I m planning to build a static analyzer tool for a proprietary language. I m planning to use ANTLR to build the AST. I would like to know how does one go about checking for rules and guidelines , set by the project using the proprietary language using the AST.
for e.g. if I build the AST for a C source code with and say i want to check for null pointers . How would i do this check using the AST or CST.
Will i have to code in the test with ANTLR generated lexer/parser?.
Thanks

It depends on the specific analysis you want to perform. Taking your specific example: to determine statically whether a variable might be (or will be) a null pointer, you need to construct a data flow graph. I recommend to study the dragon book.

If you want to check for null pointer values, you will need full control and data flow analysis for your proprietary language. ANTLR won't get you there without beyond-superhuman effort on your part.
Check out the flow analysis capabilities of our DMS. We have used this to do deep flow analysis of very large scale C programs.
Even using this machinery, you are going to have to do a lot of work to explain your langauge to it. This is just a lot easier than any other approach you might take.

Related

implement symbolic execution without model-checking

How can I implement symbolic execution for particular language without using model checking and Finite State Machine (FSM) for example not such as Java Path Finder? I need a detail about it. for example by what language I can implement this symbolic execution and what other things I need to know?
You need:
A parser for the language to be symbolically executed that can build ASTs
Name resolution (and associated symbol tables), so when your execution engine encounters an identifier it can determine the associated type and value
Control flow analysis, so that the symbolic execution engine can follow flow of control through the program
A symbolic algebra that can compose and simplify symbolic terms.
This needs a parser (so you can enter such equations) and prettyprinter (so you can see what it computes)
A way to specify assumed values at the point of symbolic execution start
This is rather a lot of machinery, and it is hard to find it all in one place. It is harder to build it all just for one tool, which is part of the reason you don't find many tools like this.
Our DMS Software Reengineering Toolkit has all the requisites. You may find an example
of a symbolic language implemented with DMS interesting.

How to Write a Source to Source Compiler API

I am doing a little research on source to source compilation but now that I am getting an understanding of Source to Source compilation. I am wondering are there any examples of API's for these source to source compilers.
I mean an Interface Descriptor to pass the source code of one programming language to another compiler to be compile? Please if so can you point me to these examples or could you give me tips (Just pure explanation) on writing one am still in research okay.
Oh I should note I am researching this for several days an I have came across things such as ROSE, DMS and LLVM. As said its purely research so I dont know whats the best approach I know I wouldn't use ROSE for it is only for C/C++. LLVMs' seems promising but I am new to LLVM. Oh my aim is to create a transpiler for 4 language support (Is that feasible). Which is why I just need expert Advice :)
Yes, you can have a procedural API for doing source-to-source translation. These are pretty straightforward in the abstract: define a core data structure to represent AST nodes, then define APIs to "parse file to AST", "visit tree nodes", "inspect tree nodes", "modify tree nodes", "spit out text". They get messy in the concrete, especially if the API is specific the language being translated; too much of the details of that language get wound into the APIs. While traditional, this is really a rather clumsy way to define source-to-source translators, because you then have to write tons of procedural code invoking the APIs to do the translation.
You can instead define them using a program transformation system (PTS) using source to source transformations based on surface syntax; these are patterns written using the notation of your to-be-compiled language, and your target-language, in the form of "if you see this, then replace it by that", operating on syntax trees not text strings. This means you can inspect the transforms simply by staring at them. So can your fellow programmer.
One such translation rule might look like:
rule tranlate_add_to(t: access_path, u: access_path):COBOL -> Java
" add \t to \u "
-> " \object_for\(\u\).\u += \object_for\(\t\).\t; ";
with a left-hand side "add \t to \u " specifying a COBOL fragment (this) to be replaced by the right-hand side " \object_for... " representing corresponding Java code (that). This rule uses a helper function "object_for" to decide where in a target Java program, a global variable in a the source COBOL program will be placed. (There's no avoiding writing such a function if you are translating Java to COBOL. You can argue about how sophisticated). In practice, the way such a rule works is the pattern ASTs of each side are constructed, and then the patterns are matched against a parsed AST; a match causes the corresponding subtree to be spliced into place where the match was found. (All this low level tree matching and splicing has to be done... procedurally, but somebody else has already implemented that in a PTS).
In our experience, you need one to two thousand such rules to translate one language to another. The plethora of rules comes from the combinatorics of language syntax constructs for the source language (and their perhaps different interpretations according to types; "a+b" means different things when a is an int vs when a is a string) and the target language opportunities. A nice plus of such rewrites is that one can build a somewhat simpler base translation, and apply additional rewrites from the target language to itself to clean up and optimize the translated result.
Many PTS are purely based on source-to-source surface syntax rewrites. We have found that combining both PTS and a procedural API, and making it possible to segue between them makes for very nice tool: you can use the rewrites where convenient, and procedural APIs where they don't work so well (the "object_for" function suggested above is easier to code as a procedure).
See lot more detail on how our DMS Software Reengineering Toolkit encodes such transformation rules (the one above is code in DMS style), in a language agnostic (well, parameterized) fashion. DMS offers a "pure" procedural API as OP requested with some 400 functions, but DMS encourages its users to lean heavily on the rewrites and only code as a little as necessary agains the procedural API. It would be "straightforward" (at least as straightforward as practical) to build your "4 language support" this way.
Don't underestimate the amount of effort to build such translators, even with a lot of good technical machinery as a foundation. Langauges tend to be complex beasts, and their translations doubly so. And you have to decide if you want a truly crummy translation or a good one.
I have been using ROSE compiler framework to write a source to source translator. ROSE can parse a language that it supports and create an AST from it. It provides different APIs (found in SageInterface) to perform transformation and analysis on the AST. After the transformation, you can unparse the transformed AST to produce your target source code.
If ROSE does not support parsing your input language, you can write your own parser while utilizing ROSE's SageBuilder API to build the AST. If your target language is one of the languages which ROSE supports, then you can rely on ROSE's unparser to get the target code. But if ROSE does not support your target language, then you can write your own unparser as well using different AST traversal mechanism provided by ROSE.

Code auto-generation and auto-tuning tools or language for C program?

I want to use some tools (free is better) or languages to help me do the following two tasks:
Task 1:
1. Read the specification file (text file) the user gives as input. To the user, the format of the specification file is designed by me and the user must follow it.
2. Use the specification input to generate an AST (abstract syntax tree).
3. Transform the AST into another AST by applying some optimization techniques such loop optimization, blocking or any other optimization I want. (Optional step)
4. Export the transformed AST to a source code file (C program file).
Task 2:
1. Read a source code file (C program file) and generate an AST to represent it.
2. Transform the AST into another AST by applying some optimization techniques such loop optimization, blocking or any other optimization I want. (To some optimization, I can parameterize it such as the loop unroll depth.)
3. Export the transformed AST to another optimized source file (C program file).
What OP wants in general is a program transformation system (PTS). PTS are generally capable of accepting an arbitrary syntax specification, building a parser producing ASTs from that syntax, applying source-to-source transforms to map the parsed AST to other ASTs, and then regenerating source text from the final AST.
A specific issue for OP is parsing/unparsing C source code. Almost none of the PTSs available do this for production C code (ANSI, GCC, MSStudio) and it is quite a lot of work to get this right. Nor do they provide auxiliary analyses which are needed to do interesting transformations, such as symbol tables, control or data flow analysis.
To my knowledge, only our DMS Software Reengineering Toolkit, and Rose Compiler, have specific support like this for C.
Rose, however, isn't designed to accept a DSL; it violates the PTS model by not allowing arbitrary syntax definitions. Instead, it uses the EDG parser front end (I think this means it also accepts C++14). But it can't handle OP's first request easily. Rose also does "source-to-source" transformations, but does so by hand-written procedural code that crawls the AST. It is focused on scientific computing, so they have done specific work on blocking loops, etc.
DMS is designed to accept arbitrary grammars (and handles C as well as C++14), and in fact can handle more than one at the same time, so it will support OP's first task directly. DMS does surface-syntax (written using C syntax directly) source-to-source rewrites as well as procedural ones. It has not been used for loop blocking, but DMS has been used to build vector extensions of C++ with code generation for SIMD instructions including appropriate loop optimizations.
The POET (Parameterized Optimization for Empirical Tuning, http://www.cs.uccs.edu/~qyi/poet) script language is one candidate. Are there any other tools or language?

Why do tools like yacc and ANTLR generate source code?

These tools basically input a grammar and output code which processes a series of tokens into something more useful, like a syntax tree. But could these tools be written in the form of a library instead? What is the reason for generating source code as output? Is there a performance gain? Is it more flexible for the end user? Easier to implement for the authors of yacc and ANTLR?
Sorry if the question is too vague, I'm just curious about the historical reasons behind the decisions the authors made, and what purpose auto-generated code has in today's environment.
There's a big performance advantage achieved by the parser generator working out the interactions of the grammar rules with respect to one another, and compiling the result to code.
One could build interpreters that simply accepted grammars and did the parsing; there are parser types (Earley) that would actually be relatively good at that, and one could compute the grammar interactions at runtime (Earley parsers kind of do this anyway) rather than offline and then execute the parsing algorithm.
But you would pay a parsing performance penalty of 10 to 100x slowdown, and probably a big storage demand.
If you are parsing using only very small grammars, or you are parsing only very small documents, this might not matter. But the grammars that many parser generators get applied too end up being fairly big (people keep wanting to add things to what you can say in a language), and they often end up processing pretty big documents. So performance now matters, and viola, people build code-generating parser generators.
Once you have a tool, it is often easier to use even in simple cases. So now that you have parser generators, you can even apply them to little grammars or to parsing little documents.
EDIT: Addendum. The historical reason is probably driven by space and time demands. Earlier systems had not a lot of room (32Kb in 1975), didn't run very fast (1 MIPS same time frame), and people had big source files already. Parser generators tended to help with this set of problems; interpreted grammars would have had intolerably bad performance.
Ira Baxter gave you one set of reasons for not handling the grammar parsing at runtime.
There is another reason too. Associated with each rule in the grammar is the appropriate action. The action is normally a fragment of a separate language (for example, C or C++). All actions in a grammar interpreted at runtime would have to be mappable to something appropriate in the program. In general, that's a losing proposition. The fragments can do all sorts of things, referencing parts of the stack ($$, $1, etc) and invoking actions (YYACCEPT, etc). Designing the runtime system so that it could be reliably used with such fragments would be tough. You'd like be into creating source code and compiling that into a DSO (dynamic shared object) or DLL (dynamic link library) and loading it. That requires a compiler on the customer's machine, where the customer may have deliberately designed their production system to be compiler-free.

Creating a simple Domain Specific Language

I am curious to learn about creating a domain specific language. For now the domain is quite basic, just have some variables and run some loops, if statements.
Edit :The language will be Non-English based with a very simple syntax .
I am thinking of targeting the Java Virtual Machine, ie compile to Java byte code.
Currently I know how to write some simple grammars using ANTLR.
I know that ANTLR creates a lexer and parser but how do I go forward from here?
about semantic analysis: does it have to be manually written or are there some tools to create it?
how can the output from the lexer and parser be converted to Java byte code?
I know that there are libraries like ASM or BCEL but what is the exact procedure?
are there any frameworks for doing this? And if there is, what is the simplest one?
You should try Xtext, an Eclipse-based DSL toolkit. Version 2 is quite powerful and stable. From its home page you have plenty of resources to get you started, including some video tutorials. Because the Eclipse ecosystem runs around Java, it seems the best choice for you.
You can also try MPS, but this is a projectional editor, and beginners may find it more difficult. It is nevertheless not less powerful than Xtext.
If your goal is to learn as much as possible about compilers, then indeed you have to go the hard way - write an ad hoc parser (no antlr and alike), write your own semantic passes and your own code generation.
Otherwise, you'd better extend an existing extensible language with your DSL, reusing its parser, its semantics and its code generation functionality. For example, you can easily implement an almost arbitrary complex DSL on top of Clojure macros (and Clojure itself is then translated into JVM, you'll get it for free).
A DSL with simple syntax may or may not mean simple semantics.
Simple semantics may or may not mean easy translation to a target language; such translations are "technically easy" only if the DSL and the target languate share a lot of common data types and execution models. (Constraint systems have simple semantics, but translating them to Fortran is really hard!). (You gotta wonder: if translating your DSL is easy, why do you have it?)
If you want to build a DSL (in your case you stick with easy because you are learning), you want DSL compiler infrastructure that has whatever you need in it, including support for difficult translations. "What is needed" to handle translating all DSLs to all possible target languages is clearly an impossibly large set of machinery.
However, there is a lot which is clear that can be helpful:
Strong parsing machinery (who wants to diddle with grammars whose structure is forced
by the weakness of the parsing machinery? (If you don't know what this is, go read about LL(1) grammmars as an example).
Automatic construction of a representation (e.g, an abstract syntax tree) of the parsed DSL
Ability to access/modify/build new ASTs
Ability to capture information about symbols and their meaning (symbol tables)
Ability to build analyses of the AST for the DSL, to support translations that require
informatoin from "far away" in the tree, to influence the translation at a particular point in the tree
Ability to reogranize the AST easily to achieve local optimizations
Ability to consturct/analysis control and dataflow information if the DSL has some procedural aspects, and the code generation requires deep reasoning or optimization
Most of the tools available for "building DSL generators" provide some kind of parsing, perhaps tree building, and then leave you to fill in all the rest. This puts you in the position of having a small, clean DSL but taking forever to implement it. That's not good. You really want all that infrastructure.
Our DMS Software Reengineering Toolkit has all the infrastructure sketched above and more. (It clearly doesn't, and can't have the moon). You can see a complete, all-in-one-"page", simple DSL example that exercises some ineresting parts of this machinery.