Reorder token stream in XText lexer - antlr

I am trying to lex/parse an existing language by using XText. I have already decided I will use a custom ANTLRv3 lexer, as the language cannot be lexed in a context-free way. Fortunately, I do not need parser information; just the previously encountered tokens is enough to decide the lexing mode.
The target language has an InputSection that can be described as follows: InputSection: INPUT_SECTION A=ID B=ID;. However, it can be specified in two different ways.
; The canonical way
$InputSection Foo Bar
$SomeOtherSection Fonzie
; The haphazard way
$InputSection Foo
$SomeOtherSection Fonzie
$InputSection Bar
Could I use TokenStreamRewriter to reorder all tokens in the canonical way, before passing this on to the parser? Or will this generate issues in XText later?

After a lot of investigation, I have come to the conclusion that editor tools themselves are truly not fit for this type of problem.
If you would start typing on one rule, you would have to take into account the AST context from subsequent sections to know how to auto-complete. At the same time, this will be very confusing for the user.
In the end, I will therefore simply not support this obscure feature of the language. Instead, the AST will be constructed so that a section (reasonably) divided between two parts will still parse correctly.

Related

ANTRL4: Extracting expressions from languages

I have a programming language that has many constructs in it however I am only interested in extracting expressions from the language.
Is that possible to do without having to write the entire grammar?
Yes, its possible. You want what is called an "island parser". https://en.wikipedia.org/wiki/Island_grammar. You might not actually
decide to do this, more below.
The essential idea is to provide detailed grammar rules for the part of the language ("islands") you care about, and sloppy rules for the rest ("water").
The detailed grammar rules... you write as would normally write them. This includes building a lexer and parser to parse the part you want.
The "water" part is implemented as much as you can by defining sloppy lexemes. You may need more than one, and you will likely have to handle nested structures e.g., things involving "("...")" "["..."] and "{" ... "}" and you will end up doing with explicit tokens for the boundaries of these structures, and recursive grammar rules that keep track of the nesting (because lexers being FSAs typically can't track this).
Not obvious when you start, but painfully obvious after you are deep into this mess is skipping over long comment bodies, and especially string literals with the various quotes allowed by the language (consider Python for an over the top set) and the escaped sequences inside. You'll get burned by languages that allow interpolated strings when you figure out that you have the lex the raw string content separately from the interpolated expressions, because these are also typically nested structures. PHP and C# allow arbitrary expressions in their interpolated strings.... including expressions which themselves can contain... more interpolated strings!
The upside is all of this isn't really hard technically, if you ignore the sweat labor to dream up and handle all the funny cases.
... but ... considering typical parsing goals, island grammars tend to fall apart when used for this purpose.
To process expressions, you usually need the language declarations that provide types for the identifiers. If you leave them in the "ocean" part... you don't get type declarations and now it is hard to reason about your expressions. If you are processing java, and you encounter (a+b), is that addition or string concatenation? Without type information you just don't know.
If you decide you need the type information, now you need the detailed grammar for the variable and type declarations. And suddenly you're a lot closer to a full parser. At some point, you bail and just build a full parser; then you don't have think about whether you've cheated properly.
You don’t mention your language, but there’s a good chance that there’s an ANTLR grammar for it here ANTLR Grammars
These grammars will parse the entire contents of the source (by doing this, you can avoid some “messiness” that can come with trying decide when to pop into, and out of, island grammars, which could be particularly messy for expressions since they can occur in so many places within a typical source file.)
Once you have the resulting ParseTree, ANTLR provides a Listener capability that allows you to call a method to walk the tree and call you back for only those parts you are interested in. In your case that would be expressions.
A quick search on ANTLR Listeners should turn up several resources on how to write a Listener for your needs. (This is a pretty short article that covers the basics (in this case, for when you’re only interested in methods, but expressions would be similar. There are certainly others).

Is there a way to generate unit test to test my grammar

I created my grammar using antlr4 but I want to test robustess
is there an automatic tool or a good way to do that fast
Thanks :)
As it's so hard to find real unit tests for ANTLR, I wrote 2 articles about it:
Unit test for Lexer
Unit test for Parser
A Lexer test checks whether a given text is read and converted to the expected Token sequences. It's useful for instance to avoid ambiguity errors.
A Parser test take a sequence of tokens (that is, it starts after the lesser part) and checks whether that token sequence traverses the expected rules ( java methods).
The only way I found to create unit tests for a grammar is to create a number of examples from a written spec of the given language. This is neither fast, nor complete, but I see no other way.
You could be tempted to create test cases directly from the grammar (writing a tool for that isn't that hard). But think a moment about this. What would you test then? Your unit tests would always succeed, unless you use generated test cases from an earlier version of the grammar.
A special case is when you write a grammar for a language that has already a grammar for another parser generation tool. In that case you could use the original grammar to generate test cases which you then can use to test your new grammar for conformity.
However, I don't know any tool that can generate the test cases for you.
Update
Meanwhile I got another idea that would allow for better testing: have a sentence generator that generates random sentences from your grammar (I'm currently working on one in my Visual Studio Code ANTLR4 extension). The produced sentences can then be examined using a heuristic approach, for their validity:
Confirm the base structure.
Check for mandatory keywords and their correct order.
Check that identifiers and strings are valid.
Watch out for unusual constructs that are not valid according to language.
...
This would already cover a good part of the language, but has limits. Matching code and generating it are not 1:1 operations. A grammar rule that matches certain (valid) input might generate much more than that (and can so produce invalid input).
In one chapter of his book 'Software Testing Techniques' Boris Beizer addresses the topic of 'syntax testing'. The basic idea is to (mentally or actually) take a grammar and represent it as a syntax diagram (aka railroad diagram). For systematic testing, this graph then would be covered: Good cases where the input matches the elements, but also bad cases for each node. Iterations and recursive calls would be handled like loops, that is, cases with zero, one, two, one less than max, max, once above max iterations (i.e. occurrences of the respective syntactic element).

General stategy for designing Flexible Language application using ANTLR4

Requirement:
I am trying to develop a language application using antlr4. The language in question is not important. The important thing is that the grammar is very vast (easily >2000 rules!!!). I want to do a number of operations
Extract bunch of informations. These can be call graphs, variable names. constant expressions etc.
Any number of transformations:
if a loop can be expanded, we go ahead and expand it
If we can eliminate dead code we might choose to do that
we might choose to rename all variable names to conform to some norms.
Each of these operations can be applied independent of each other. And after application of these steps I want the rewrite the input as close as possible to the original input.
e.g. So we might want to eliminate loops and rename the variable and then output the result in the original language format.
Questions:
I see a need to build a custom Tree (read AST) for this. So that I can modify the tree with each of the transformations. However when I want to generate the output, I lose the nice abilities of the TokenStreamRewriter. I have to specify how to write each of the nodes of the tree and I lose the original input formatting for the places I didn't do any transformations. Does antlr4 provide a good way to get around this problem?
Is AST the best way to go? Or do I build my own object representation? If so how do I create that object efficiently? Creating object representation is very big pain for such a vast language. But may be better in the long run. Again how do I get back the original formatting?
Is it possible to work just on the parse tree?
Are there similar language applications which do the same thing? If so what strategy do they use?
Any input is welcome.
Thanks in advance.
In general, what you want is called a Program Transformation System (PTS).
PTSs generally have parsers, build ASTs, can prettyprint the ASTs to recover compilable source text. More importantly, they have standard ways to navigate/inspect/modify the ASTs so that you can change them programmatically.
Many offer these capabilities in the form of pattern-matching code fragments written in the surface syntax of the language being transformed; this avoids the need to forever having to know excruciatingly fine details about which nodes are in your AST and how they are related to children. This is incredibly useful when you big complex grammars, as most of our modern (and our legacy languages) all seem to have.
More sophisticated PTSs (very few) provide additional facilities for teasing out the semantics of the source code. It is pretty hard to analyze/transform most code without knowing what scopes individual symbols belong to, or their type, and many other details such as data flow. Full disclosure: I build one of these.

Generating random but still valid expressions based on yacc/bison/ANTLR grammar

Is it possible? Any tool available for this?
You can do this with any system that gives you access to base grammar. ANTLR and YACC compile your grammar away so you don't have them anymore. In ANTLR's case, the grammar has been turned into code; you're not going to get it back. In YACC's case, you end up with parser tables, which contain the essence of the grammar; you could walk such parse tables if you understood them well enough to do what I describe below as.
It is easy enough to traverse a set of explicitly represented grammar rules and randomly choose expansions/derivations. By definition this will get you valid syntax.
What it won't do is get you valid code. The problem here is that most languages really have context sensitive syntax; most programs aren't valid unless the declared identifiers are used in a way consistent with their declaration and scoping rules. That latter requires a full semantic check.
Our DMS Software Reengineering Toolkit is used to parse code in arbitrary languages [using a grammar], build ASTs, lets you analyze and transform those trees, and finally prettyprint valid (syntactic) text. DMS provides direct access to the grammar rules, and tree building facilities, so it is pretty easy to generate random syntactic trees (and prettyprint). Making sure they are semantically valid is hard with DMS too; however, many of DMS's front ends can take a (random) tree and do semantic checking, so at least you'd know if the tree was semantically valid.
What you do if it says "no" is still an issue. Perhaps you can generate identifier names in way that guarantees at least not-inconsistent usage, but I suspect that would be langauge-dependent.
yacc and bison turn your grammar into a finite state machine. You should be able to traverse the state machine randomly to find valid inputs.
Basically, at each state you can either shift a new token on to the stack and move to a new state or reduce the top token in the stack based on a set of valid reductions. (See the Bison manual for details about how this works).
Your random generator will traverse the state machine making random but valid shifts or reductions at each state. Once you reach the terminal state you have a valid input.
For a human readable description of the states you can use the -v or --report=state option to bison.
I'm afraid I can't point you to any existing tools that can do this.

removing dead variables using antlr

I am currently servicing an old VBA (visual basic for applications) application. I've got a legacy tool which analyzes that application and prints out dead variables. As there are more than 2000 of them I do not want to do this by hand.
Therefore I had the idea to transform the separate codefiles which contain the dead variable according to the aforementioned tool to ASTs and remove them that way.
My question: Is there a recommended way to do this?
I do not want to use StringTemplate, as I would need to create templates for all rules and if I had a commend on the hidden channel, it would be lost, right?
Everything I need is to remove parts of that code and print out the rest as it was read in.
Any one has any recommendations, please?
Some theory
I assume that regular expressions are not enough to solve your task. That is you can't define the notion of a dead-code section in any regular language and expect to express it in a context-free language described by some antlr grammar.
The algorithm
The following algorithm can be suggested:
Tokenize source code with a lexer.
Since you want to preserve all the correct code -- don't skip or hide it's tokens. Make sure to define separate tokens for parts which may be removed or which will be used to determine the dead code, all other characters can be collected under a single token type. Here you can use output of your auxiliary tool in predicates to reduce the number of tokens generated. I guess antlr's tokenization (like any other tokenization) is expressed in a regular language so you can't remove all the dead code on this step.
Construct AST with a parser.
Here all the powers of a context-free language can be applied -- define dead-code sections in parser's rules and remove it from the AST being constructed.
Convert AST to source code. You can use some tree parser here, but I guess there is an easier way which can be found observing toString and similar methods of a tree type returned by the parser.