Generating random but still valid expressions based on yacc/bison/ANTLR grammar

Generating random but still valid expressions based on yacc/bison/ANTLR grammar - antlr

Is it possible? Any tool available for this?

You can do this with any system that gives you access to base grammar. ANTLR and YACC compile your grammar away so you don't have them anymore. In ANTLR's case, the grammar has been turned into code; you're not going to get it back. In YACC's case, you end up with parser tables, which contain the essence of the grammar; you could walk such parse tables if you understood them well enough to do what I describe below as.
It is easy enough to traverse a set of explicitly represented grammar rules and randomly choose expansions/derivations. By definition this will get you valid syntax.
What it won't do is get you valid code. The problem here is that most languages really have context sensitive syntax; most programs aren't valid unless the declared identifiers are used in a way consistent with their declaration and scoping rules. That latter requires a full semantic check.
Our DMS Software Reengineering Toolkit is used to parse code in arbitrary languages [using a grammar], build ASTs, lets you analyze and transform those trees, and finally prettyprint valid (syntactic) text. DMS provides direct access to the grammar rules, and tree building facilities, so it is pretty easy to generate random syntactic trees (and prettyprint). Making sure they are semantically valid is hard with DMS too; however, many of DMS's front ends can take a (random) tree and do semantic checking, so at least you'd know if the tree was semantically valid.
What you do if it says "no" is still an issue. Perhaps you can generate identifier names in way that guarantees at least not-inconsistent usage, but I suspect that would be langauge-dependent.

yacc and bison turn your grammar into a finite state machine. You should be able to traverse the state machine randomly to find valid inputs.
Basically, at each state you can either shift a new token on to the stack and move to a new state or reduce the top token in the stack based on a set of valid reductions. (See the Bison manual for details about how this works).
Your random generator will traverse the state machine making random but valid shifts or reductions at each state. Once you reach the terminal state you have a valid input.
For a human readable description of the states you can use the -v or --report=state option to bison.
I'm afraid I can't point you to any existing tools that can do this.

Related

ANTRL4: Extracting expressions from languages

I have a programming language that has many constructs in it however I am only interested in extracting expressions from the language.
Is that possible to do without having to write the entire grammar?

Yes, its possible. You want what is called an "island parser". https://en.wikipedia.org/wiki/Island_grammar. You might not actually
decide to do this, more below.
The essential idea is to provide detailed grammar rules for the part of the language ("islands") you care about, and sloppy rules for the rest ("water").
The detailed grammar rules... you write as would normally write them. This includes building a lexer and parser to parse the part you want.
The "water" part is implemented as much as you can by defining sloppy lexemes. You may need more than one, and you will likely have to handle nested structures e.g., things involving "("...")" "["..."] and "{" ... "}" and you will end up doing with explicit tokens for the boundaries of these structures, and recursive grammar rules that keep track of the nesting (because lexers being FSAs typically can't track this).
Not obvious when you start, but painfully obvious after you are deep into this mess is skipping over long comment bodies, and especially string literals with the various quotes allowed by the language (consider Python for an over the top set) and the escaped sequences inside. You'll get burned by languages that allow interpolated strings when you figure out that you have the lex the raw string content separately from the interpolated expressions, because these are also typically nested structures. PHP and C# allow arbitrary expressions in their interpolated strings.... including expressions which themselves can contain... more interpolated strings!
The upside is all of this isn't really hard technically, if you ignore the sweat labor to dream up and handle all the funny cases.
... but ... considering typical parsing goals, island grammars tend to fall apart when used for this purpose.
To process expressions, you usually need the language declarations that provide types for the identifiers. If you leave them in the "ocean" part... you don't get type declarations and now it is hard to reason about your expressions. If you are processing java, and you encounter (a+b), is that addition or string concatenation? Without type information you just don't know.
If you decide you need the type information, now you need the detailed grammar for the variable and type declarations. And suddenly you're a lot closer to a full parser. At some point, you bail and just build a full parser; then you don't have think about whether you've cheated properly.

You don’t mention your language, but there’s a good chance that there’s an ANTLR grammar for it here ANTLR Grammars
These grammars will parse the entire contents of the source (by doing this, you can avoid some “messiness” that can come with trying decide when to pop into, and out of, island grammars, which could be particularly messy for expressions since they can occur in so many places within a typical source file.)
Once you have the resulting ParseTree, ANTLR provides a Listener capability that allows you to call a method to walk the tree and call you back for only those parts you are interested in. In your case that would be expressions.
A quick search on ANTLR Listeners should turn up several resources on how to write a Listener for your needs. (This is a pretty short article that covers the basics (in this case, for when you’re only interested in methods, but expressions would be similar. There are certainly others).

Reorder token stream in XText lexer

I am trying to lex/parse an existing language by using XText. I have already decided I will use a custom ANTLRv3 lexer, as the language cannot be lexed in a context-free way. Fortunately, I do not need parser information; just the previously encountered tokens is enough to decide the lexing mode.
The target language has an InputSection that can be described as follows: InputSection: INPUT_SECTION A=ID B=ID;. However, it can be specified in two different ways.
; The canonical way
$InputSection Foo Bar
$SomeOtherSection Fonzie
; The haphazard way
$InputSection Foo
$SomeOtherSection Fonzie
$InputSection Bar
Could I use TokenStreamRewriter to reorder all tokens in the canonical way, before passing this on to the parser? Or will this generate issues in XText later?

After a lot of investigation, I have come to the conclusion that editor tools themselves are truly not fit for this type of problem.
If you would start typing on one rule, you would have to take into account the AST context from subsequent sections to know how to auto-complete. At the same time, this will be very confusing for the user.
In the end, I will therefore simply not support this obscure feature of the language. Instead, the AST will be constructed so that a section (reasonably) divided between two parts will still parse correctly.

Is there a way to generate unit test to test my grammar

I created my grammar using antlr4 but I want to test robustess
is there an automatic tool or a good way to do that fast
Thanks :)

As it's so hard to find real unit tests for ANTLR, I wrote 2 articles about it:
Unit test for Lexer
Unit test for Parser
A Lexer test checks whether a given text is read and converted to the expected Token sequences. It's useful for instance to avoid ambiguity errors.
A Parser test take a sequence of tokens (that is, it starts after the lesser part) and checks whether that token sequence traverses the expected rules ( java methods).

The only way I found to create unit tests for a grammar is to create a number of examples from a written spec of the given language. This is neither fast, nor complete, but I see no other way.
You could be tempted to create test cases directly from the grammar (writing a tool for that isn't that hard). But think a moment about this. What would you test then? Your unit tests would always succeed, unless you use generated test cases from an earlier version of the grammar.
A special case is when you write a grammar for a language that has already a grammar for another parser generation tool. In that case you could use the original grammar to generate test cases which you then can use to test your new grammar for conformity.
However, I don't know any tool that can generate the test cases for you.
Update
Meanwhile I got another idea that would allow for better testing: have a sentence generator that generates random sentences from your grammar (I'm currently working on one in my Visual Studio Code ANTLR4 extension). The produced sentences can then be examined using a heuristic approach, for their validity:
Confirm the base structure.
Check for mandatory keywords and their correct order.
Check that identifiers and strings are valid.
Watch out for unusual constructs that are not valid according to language.
...
This would already cover a good part of the language, but has limits. Matching code and generating it are not 1:1 operations. A grammar rule that matches certain (valid) input might generate much more than that (and can so produce invalid input).

In one chapter of his book 'Software Testing Techniques' Boris Beizer addresses the topic of 'syntax testing'. The basic idea is to (mentally or actually) take a grammar and represent it as a syntax diagram (aka railroad diagram). For systematic testing, this graph then would be covered: Good cases where the input matches the elements, but also bad cases for each node. Iterations and recursive calls would be handled like loops, that is, cases with zero, one, two, one less than max, max, once above max iterations (i.e. occurrences of the respective syntactic element).

ANTLR: Source to Target Language Conversion

I have fair understanding on ANTLR & grammar. Is it correct to say ANTLR can do source language to target language conversion like ASP to JSP or COBOL to JSP? if yes, could you help me to provide some information/tutorial/link to explorer the possibilities?
Idea is to pragmatically translating huge amounts of code from source to target using ANTLR.
Thanks

The basic steps to building a translator in Antlr4 is to:
generate a parse tree from an input text in the source language
repeatedly walk the parse tree to analyze the nodes of the parse tree, adding and evolving properties (decorator pattern) associated with individual parse tree nodes -- the properties will describe the change(s) required to represent the content of the node in the target language.
final walk of the parse tree to collect and output the target language text.
The form and content of the properties and the progression of creation and evolution will be entirely dependent on the nature of the source and target languages and the architect's conversion strategy.
Since Antlr parse-tree walks can be logically independent of one another, specific conversion aspects can be addressed in separate walks. For example, one walk can evaluate (possibly among other things) whether individual perform until statements will be converted to if or while statements. Another walk can be dedicated to analyzing variable names to ensure they are created/accessed in the correct scope and determining the naming and scope of any target language required temporary variables. Etc.
Given that the conversion is a one-time affair, there is no fundamental penalty to implementing 5, 10, or even more walks. Just the 'whatever makes sense in your case' practicality.
The (relevant) caveat addressed in the other QA is how to handle conversions where there is no simple or near identity between statements in the two languages. To convert a unique source language statement then requires a target language run-time package be created to implement the corresponding function.
GenPackage (I am the author) automates the generation of a basic conversion project. The generated project represents but one possible architectural approach and leaves substantial work to be done to tailor it to any particular end use.

removing dead variables using antlr

I am currently servicing an old VBA (visual basic for applications) application. I've got a legacy tool which analyzes that application and prints out dead variables. As there are more than 2000 of them I do not want to do this by hand.
Therefore I had the idea to transform the separate codefiles which contain the dead variable according to the aforementioned tool to ASTs and remove them that way.
My question: Is there a recommended way to do this?
I do not want to use StringTemplate, as I would need to create templates for all rules and if I had a commend on the hidden channel, it would be lost, right?
Everything I need is to remove parts of that code and print out the rest as it was read in.
Any one has any recommendations, please?

Some theory
I assume that regular expressions are not enough to solve your task. That is you can't define the notion of a dead-code section in any regular language and expect to express it in a context-free language described by some antlr grammar.
The algorithm
The following algorithm can be suggested:
Tokenize source code with a lexer.
Since you want to preserve all the correct code -- don't skip or hide it's tokens. Make sure to define separate tokens for parts which may be removed or which will be used to determine the dead code, all other characters can be collected under a single token type. Here you can use output of your auxiliary tool in predicates to reduce the number of tokens generated. I guess antlr's tokenization (like any other tokenization) is expressed in a regular language so you can't remove all the dead code on this step.
Construct AST with a parser.
Here all the powers of a context-free language can be applied -- define dead-code sections in parser's rules and remove it from the AST being constructed.
Convert AST to source code. You can use some tree parser here, but I guess there is an easier way which can be found observing toString and similar methods of a tree type returned by the parser.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas