Can ANTLR return Lines of Code when lexing? - antlr

I am trying use ANTLR to analyse a large set of code using full Java grammar. Since ANTLR needs to open all the source files and scan them, I am wondering if it can also return lines of code.
I checked API for Lexer and Parser, it seems they do not return LoC. Is it easy to instrument the grammar rule a bit to get LoC? The full Java rule is complicated, I don't really want to mess a large part of it.

If you have an existing ANTLR grammar, and want to count certain things during parsing, you could do something like this:
grammar ExistingGrammar;
// ...
#parser::members {
public int loc = 0;
}
// ...
someParserRule
: SomeLexerRule someOtherParserRule {loc++;}
;
// ...
So, whenever your oparser encounters a someParserRule, you increase the loc by one by placing {loc++;} after (or before) the rule.
So, whatever your definition of a line of code is, simply place {loc++;} in the rule to increase the counter. Be careful not to increase it twice:
statement
: someParserRule {loc++;}
| // ...
;
someParserRule
: SomeLexerRule someOtherParserRule {loc++;}
;
EDIT
I just noticed that in the title of your question you asked if this can be done during lexing. That won't be possible. Let's say a LoC would always end with a ';'. During lexing, you wouldn't be able to make a distinction between a ';' after, say, an assignment (which is a single LoC), and the 2 ';'s inside a for(int i = 0; i < n; i++) { ... } statement (which wouldn't be 2 LoC).

In the C target the data structure ANTLR3_INPUT_STREAM has a getLine() function which returns the current line from the input stream. It seems the Java version of this is CharStream.getLine(). You should be able to call this at any time and get the current line in the input stream.

Use a visitor to visit the CompilationUnit context, then context.stop.getLine() will give you the last line number of the compilation unit context.
#Override public Integer visitCompilationUnit(#NotNull JAVAParser.CompilationUnitContext ctx) {
return ctx.stop.getLine();
}

Related

Dropping a token in Yacc/Bison when an error is matched in the rhs production

I am writing a simple calculator in which an expression can be reduced to a statement, or list of statements. If a bad expression triggers a syntax error, I try to catch it with a production rule and then ignore it by giving the rule no actions. However, I believe this still reduces to a stmt_list despite it not being a valid statement.
Is there a way to make it simply ignore tokens matched with an error and prevent them from being reduced and used later?
The code below matches error ';' and reduces it to a stmt_list. It will then try to reduce that stmt_list with a valid expression, but since the first production was never called this will trigger a memory exception. My objective is to have Bison literally do nothing if an error is matched, such that a later valid expression can be the first reduction to stmt_list.
stmt_list:
expr ';' {
// Allocate memory for statement array
stmts = (float*)malloc(24 * sizeof(float));
// Assign pointer for array
$$ = stmts;
// At pointer, assign statement value
*$$ = $1;
// Increment pointer (for next job)
$$++;
}
| stmt_list expr ';' {
$$ = $1;
*$$ = $2;
$$++;
}
| error ';' { } // Do nothing (ignore bad stmt)
| stmt_list error ';' { } // Do nothing (ignore bad stmt)
;
If you supply no action for a rule, bison/yacc provides the default action $$ = $1.
In fact, you are not providing no action. You are providing an explicit action which does nothing. As it happens, if you use the C template, the parser will still perform the default action. In other templates, an action which does not assign a value to $$ might provoke a warning during parser generation. But it certainly won't modify your data structures so as to nullify the action. It can't know what that means. If you know, you should write it as the action :-) .
It's not 100% clear to me why you are keeping the results of the evaluations in a fixed-size dynamically-allocated array. You make no attempt to detect when the array fills up, so it's entirely possible that you'll end up overflowing the allocation and overwriting random memory. Moreover, using a global like this isn't usually a good idea because it prevents you from building more than one list at the same time. (For example, if you wanted to implement function calls, since a function's arguments are also a list of expressions.)
On the whole, it's better to put the implementation of the expanding expression list in a simple API which is implemented elsewhere. Here, I'm going to assume that you've done that; for specificity, I'll assume the following API (although it's just one example):
/* The list header structure, which contain all the information
* necessary to use the list. The forward declaration makes it
* possible to use pointers to ExprList objects without having to
* expose its implementation details.
*/
typedef struct ExprList ExprList;
/* Creates a new empty expression-list and returns a pointer to its header. */
ExprList* expr_list_create(void);
/* Resizes the expression list to the supplied size. If the list
* currently has fewer elements, new elements with default values are
* added at the end. If it currently has more elements, the excess
* ones are discarded. Calling with size 0 empties the list (but
* doesn't delete it).
*/
int expr_list_resize(ExprList* list, int new_length);
/* Frees all storage associated with the expression list. The
* argument must have been created with expr_list_create, and its
* value must not be used again after this function returns.
*/
void expr_list_free(ExprList* list);
/* Adds one element to the end of the expression-list.
* I kept the float datatype for expression values, although I
* strongly believe that its not ideal. But an advantage of using an
* API like this is that it is easier to change.
*/
void expr_list_push(ExprList* list, float value);
/* Returns the number of elements in the expression-list. */
int expr_list_len(ExprList* list);
/* Returns the address of the element in the expression list
* with the given index. If the index is out of range, behaviour
* is undefined; a debugging implementation will report an error.
*/
float* expr_list_at(ExprList* list, int index);
With that API, we can rewrite the productions for valid expressions:
stmt_list: expr ';' { $$ = expr_list_create();
expr_list_push($$, $1);
}
| stmt_list expr ';' { $$ = $1;
expr_list_push($$, $1);
}
Now for the error cases. You have two error rules; one triggers when the error is at the beginning of a list, and the other when the error is encountered after one or more (possibly erroneous) expressions have been handled. Both of these are productions for stmt_list so they must have the same value type as stmt_list does (ExprList*). Thus, they must do whatever you think is appropriate when a syntax error is produced.
The first one, when the error is at the start of the list, only needs to create an empty list. It's hard to see what else it could do.
stmt_list: error ';' { $$ = expr_list_create(); }
It seems to me that there are at least two alternatives for the other error action, when an error is detected after the list has at least one successfully-computed value. One possibility is to ditch the erroneous item, leaving the rest of the list intact. This requires only the default action:
stmt_list: stmt_list error ';'
(Of course, you could add the action { $$ = $1; } explicitly, if you wanted to.)
The other possibility is to empty the entire list, so as to start from scratch with the next element:
stmt_list: stmt_list error ';' { $$ = $1;
expr_list_resize($$, 0);
}
There are undoubtedly other possibilities. As I said, bison cannot figure out what it is that you intended (and neither can I, really). You'll have to implement whatever behaviour you want.

reading signs in an equation

Using this https://github.com/antlr/grammars-v4/tree/master/cpp antlr grammar Im trying to parse C++ code. Below is the same visitor class I'm using, I don't have much visitor function implemented,
#include <iostream>
#include <antlr4-runtime.h>
#include "parser/CPP14Lexer.h"
#include "parser/CPP14BaseVisitor.h"
#include "parser/CPP14Parser.h"
#include "parser/CPP14Visitor.h"
class TREEVisitor : public CPP14BaseVisitor {
public:
virtual antlrcpp::Any TREEVisitor::visitAdditiveExpression(
CPP14Parser::AdditiveExpressionContext *ctx) override
{
std::cout << "AddExpr : " << ctx->getText() << std::endl;
std::vector<CPP14Parser::MultiplicativeExpressionContext *> mulpExprCtx =
ctx->multiplicativeExpression();
for (CPP14Parser::MultiplicativeExpressionContext *mulpExprLp : mulpExprCtx)
{
std::vector<CPP14Parser::PointerMemberExpressionContext *> ptrMbrExprCtx =
mulpExprLp->pointerMemberExpression();
// ptrMbrExprCtx->pointerMemberExpression()->castExpression()->unaryExpression();
// Different parts of an expression
for (CPP14Parser::PointerMemberExpressionContext *ptrMbrExprLp : ptrMbrExprCtx)
{
std::cout << "=> " << ptrMbrExprLp->getText() << std::endl;
}
}
return visitChildren(ctx);
}
};
int main(int argc, char *argv[]) {
std::ifstream stream;
stream.open(argv[1]);
antlr4::ANTLRInputStream input(stream);
CPP14Lexer lexer(&input);
antlr4::CommonTokenStream tokens(&lexer);
CPP14Parser parser(&tokens);
antlr4::tree::ParseTree *tree = parser.translationunit();
// Visitor
auto *visitor = new TREEVisitor();
visitor->visit(tree);
return 0;
}
Im trying to parse the following C++ code,
int ii = a + b - getLength() * 10 / 1;
What I'm trying to achieve here is to get all of the variables that are used to initilize the variable i and their signs. Something like below, where i can relate each sign to the values/variables(for example to know that + as after a.
a
+
b
-
getLength()
*
10
/
1;
So far I can only get an output as follow,
AddExpr : a+b-c*10/1
=> a
=> b
=> getLength()
=> 10
=> 1
I don't seem to be able to get the signs between each operation.
I seem to have something related to the signs in that equation, I had only Star and Mod.
tree::TerminalNode* startTn = mulpExprLp->Star();
So I tried to change the grammar file to get other signs as well. While that gave me the signs in that equation but again... I wasn't ablel to know the position of each sign in the equation.
multiplicativeExpression:
pointerMemberExpression (
(Star | Div | Mod | Plus | Minus) pointerMemberExpression
)*;
I hope I could describe the problem clearly. I basically want to read the each part of an equation and know what is the position of each sign.
Thanks,
Alex
It looks like you need a better understanding of the structure of your parse tree.
I would suggest going back to the original grammar (there are many problems with your multiplcativeExpression, mostly around it not building a proper parse tree.
Viewing the graphical version of your parse tree should be quite useful. This page gives a brief intro to setting up a grun alias to use TestRig. It’s usually a good idea to “play around” a bit with grun and various input to gain a better understanding of what ANTLR produces (token streams, parse trees, etc.) for your grammar.
Take a look at the documentation and how to run the TestRig utility with the -gui command line option. This will give you a graphical representation of your parse tree. Your immediate issue is that, since you only have a visitor for additiveExpression, it won’t include the sub tree for the mutiplicativeExpression that will hold the structure for multiplication and division.
Also, since you’re not finding the operations you need to take a closer look at the cpp14parser::AdditiveExpressionContext generated for your additiveExpression. The operator(s) should be available at one of the indices of your children nodes (the rule is written to allow multiple addition/subtraction in a single context, so they’ll probably be available in some list/array structure (sorry, not intimately familiar with what ANTLR generates for C++)
BTW, you may find that, for your purposes, a listener is easier to use than a visitor. With Listeners a ParseTreeWalker takes care of walking the tree and calling back to your code as nodes are encountered. With Visitors, it’s up to you to navigate the parse Tree (they can be useful when you need more flexibility, and a bit easier to handle things if you want a value returned from visiting a node, but I find Listeners much simpler for most use cases)

How to synthesise compiler testing data?

I am writing a simple compiler as a school work. I am looking for an automated approach to generate both positive and negative testing data to test my compiler, given the formal grammar and other specification. The language I am dealing with is of mediate size with 38 or so non-terminals. For the sake of illustration, here is a snapshot of the grammar:
program: const_decl* declaration* ENDMARKER
# statement
stmt: flow_stmt | '{' stmt* '}' | NAME [stmt_trailer] ';' | ';'
stmt_trailer: arglist | ['[' expr ']'] '=' expr
flow_stmt: if_stmt | for_stmt | while_stmt | read_stmt ';' | write_stmt ';' | return_stmt ';'
return_stmt: 'return' ['(' expr ')']
if_stmt: 'if' '(' condition ')' stmt ['else' stmt]
condition: expr ('<'|'<='|'>'|'>='|'!='|'==') expr | expr
for_stmt: ('for' '(' NAME '=' expr ';' condition ';'
NAME '=' NAME ('+'|'-') NUMBER ')' stmt)
Is there any tools to generate input file with the help of the grammar? The hand-written tests are too tedious or too weak to discover problems. An example of this language here:
void main() {
int N;
int temp;
int i, j;
int array_size;
reset_heap;
scanf(N);
for (i = 0; i < N; i = i + 1) {
scanf(array_size);
if (array_size > max_heap_size) {
printf("array_size exceeds max_heap_size");
} else {
for (j = 0; j < array_size; j = j + 1) {
scanf(temp);
heap[j] = temp;
}
heap_sort(array_size);
print_heap(array_size);
}
}
}
Generating controllable testing data automatically can save the days. Given the simplicity of the language, there must be some way to effectively do this. Any pointer and insight is greatly appreciated.
Any pointer and insight is greatly appreciated.
This should have the subtopic of How to avoid combinatorial explosion when generating test data.
While I would not be surprised if there are tools to do this having had the same need to generate test data for grammars I have created a few one off applications.
One of the best series of articles I have found on this is by Eric Lippert, Every Binary Tree There Is, think BNF converted to binary operators then converted to AST when you read tree. However he uses Catalan (every branch has two leaves) and when I wrote my app I preferred Motzikin (a branch can have one or two leaves).
Also he did his in C# with LINQ and I did mine in Prolog using DCG.
Generating the data based on the BNF or DCG is not hard, the real trick is to limit the area of expansion and the size of the expansion and to inject bad data.
By area of expansion lets say you want to test nested if statements three levels deep, but have to have valid code that compiles. Obviously you need the boilerplate code to make it compile then you start changing the deeply nested if by adding or removing the else clause. So you need to put in constraints so that the boilerplate code is constant and the testing part is variable.
By size of expansion lets say that you want to test conditional expressions. You can easily calculate that if you have many operators and you want to test them all in combinations you soon run into combinatorial explosion. The trick is to ensure you test deep enough and with enough breadth but not every combination. Again the judicial use of constraints helps.
So the point of all of this is that you start with a tool that takes in the BNF and generates valid code. Then you modify the BNF to add constraints and modify the generator to understand the constraints to generate the code examples.
Then you modify the BNF for invalid data and likewise the generator to understand those rules.
After that is working you can then start layering on levels of automation.
If you do go this route and decide that you will have to learn Prolog, take a look at Mercury first. I have not done this with Mercury, but if I do it again Mercury is high on the list.
While my actual code is not public, this and this is the closest to it that is public.
Along the way I had some fun with it in Code Golf.
When generating terminals such as reserved words or values for types, you can use predefined list with both valid and invalid data, e.g. for if if the language is case sensitive I would include in the list if,If,IF,iF, etc. For value types such as unsigned byte I would include -1,0,255 and 256.
When I was testing basic binary math expressions with +, -, * and ^ I generated all the test for with three basic numbers -2,-1,0,1, and 2. I thought it would be useless since I already had hundreds of test cases, but since it only took a few minutes to generate all of the test cases and several hours to run it, to my surprise it found a pattern I did not cover. The point here is that contrary what most people say about having to many test cases, remember that it is only time on a computer by changing a few constraints so do the large number of test.

Is there a benefit/penalty in record modification?

In a functional program I have an API that provides functions on complex state implemented as a record:
let remove_number nr {counter ; numbers ; foo } = {counter ; numbers = IntSet.remove nr numbers ; foo}
let add_fresh {counter ; numbers ; foo } = { counter = counter + 1 ; numbers = IntSet.add counter numbers ; foo }
I know, I can use the simplified record modification syntax like this:
let remove_number nr state = { state with numbers = IntSet.remove nr numbers }
When the record type grows, the latter style is actually more readable. Hence, I will probably use it anyway. But out of curiosity, I wonder, whether it also allows the compiler to detect possible memory reusage more easily (my application is written in a monadic style, so there will usually only be one record that is passed along, hence an optimizing compiler could remove all allocations but one and do in-place-mutation instead). In my limited view, the with-syntax gives a good heuristic for places to apply such optimization, but is that true?
Does OCaml even optimize (unneeded) record allocations?
Is the record modification syntax lowered before any optimizations apply?
And finally, is there any pattern recognition implemented in the ocaml
compiler, that tells it that there is a "cheap" way to create one
record expression by modifying a "dead" value in place (and how is
that optimization usually called)?
The two versions of remove_number that you give are equivalent. The { expr with ... } notation doesn't modify a record. It creates a new record.
Record modification looks like this:
let remove_number nr rec = rec.numbers <- IntSet.remove nr rec.numbers
I don't think OCaml does the sort of optimization you describe. The plan with OCaml is to generate code that's close to what you write.

antlr infinit eof loop

Please have a look at my grammar: https://bitbucket.org/rstoll/tsphp-parser/raw/cdb41531e86ec66416403eb9c29edaf60053e5df/src/main/antlr/TSPHP.g
Somehow ANTLR produces an infinite loop finding infinite EOF tokens for the following input:
class a{public function void a(}
Although, only prog expects EOF classBody somehow accept it as well. Has someone an idea how I can fix that, what I have to change that classBody does not accept EOF tokens respectively?
Code from the generated class:
// D:\\TSPHP-parser\\src\\main\\antlr\\TSPHP.g:287:129: ( classBody )*
loop17:
do {
int alt17=2;
int LA17_0 = input.LA(1);
if ( (LA17_0==EOF||LA17_0==Abstract||LA17_0==Const||LA17_0==Final||LA17_0==Function||LA17_0==Private||(LA17_0 >= Protected && LA17_0 <= Public)||LA17_0==Static) ) {
alt17=1;
}
switch (alt17) {
case 1 :
// D:\\TSPHP-parser\\src\\main\\antlr\\TSPHP.g:287:129: classBody
{
pushFollow(FOLLOW_classBody_in_classDeclaration1603);
classBody38=classBody();
state._fsp--;
if (state.failed) return retval;
if ( state.backtracking==0 ) stream_classBody.add(classBody38.getTree());
}
break;
default :
break loop17;
}
} while (true);
The problem occurs, when token = EOF, the loop is never quit, since EOF is a valid token, even though I haven not specified like that.
EDIT The do not get the error if I comment line 342 and 347 out (the empty case in rule accessModifierWithoutPrivateOrPublic, accessModifierOrPublic respectively)
EDIT 2 I could solve my problem. I rewrote the methodModifier rule (integrated all the possible modifier in one rule). This way ANTLR does not believe that EOF is a valid token after /empty/ in
accessModifierOrPublic
: accessModifier
| /* empty */ -> Public["public"]
;
This type of bug can occur in error handling for ANTLR 3. In ANTLR 4, the IntStream.consume() method was updated to require the following exception be thrown to preempt this problem.
Throws:
IllegalStateException - if an attempt is made to consume the the end of the stream (i.e. if LA(1)==EOF before calling consume).
For ANTLR 3 grammars, you can at least prevent an infinite loop by using your own TokenStream implementation (probably easiest to extend CommonTokenStream) and throwing this exception if the condition listed above is violated. Note that you might need to allow this condition to be violated once (reasons are complicated), so keep a count and throw the IllegalStateException if the code tries to consume EOF more than 2 or 3 times. Remember this is just an effort to break the infinite loop so you can be a little "fuzzy" on the actual check.