Improve syntactic predicate error messages? - antlr

I am trying to improve the error messages antlr gives and noticed that syntactic predicates seem to be the root of the bad error messages.
This is the one I am currently working on. Here is an example of the grammar's structure. Sorry that I cannot provide the actual grammar. Hopefully this illustrates the point though.
defs
: (a) => a | b
;
a
: A B c
;
//b is actually much further down the chain and due to ordering can't be moved up.
b
: A c
;
The issue is that for example if you have the tokens "A B D". The error message you get is from the 'b' rule. I want the error message to be from the 'a' rule. Meaning if "A B" is matched then I want an error if 'c' isn't matched.
I thought maybe you could do this
a
: (A B) => A B c | {EmitErrorMessage("error");}
;

You should relax the syntactic predicate in defs instead of adding one to a.
defs
: (A B) => a
| b
;
This will cause the parser to choose the first alternative and enter the a rule based on just the two symbols A B.

Related

ANTLR4 : clean grammar and tree with keywords (aliases ?)

I am looking for a solution to a simple problem.
The example :
SELECT date, date(date)
FROM date;
This is a rather stupid example where a table, its column, and a function all have the name "date".
The snippet of my grammar (very simplified) :
simple_select
: SELECT selected_element (',' selected_element) FROM from_element ';'
;
selected_element
: function
| REGULAR_WORD
;
function
: REGULAR_WORD '(' function_argument ')'
;
function_argument
: REGULAR_WORD
;
from_element
: REGULAR_WORD
;
DATE: D A T E;
FROM: F R O M;
SELECT: S E L E C T;
REGULAR_WORD
: (SIMPLE_LETTER) (SIMPLE_LETTER | '0'..'9')*
;
fragment SIMPLE_LETTER
: 'a'..'z'
| 'A'..'Z'
;
DATE is a keyword (it is used somewhere else in the grammar).
If I want it to be recognised by my grammar as a normal word, here are my solutions :
1) I add it everywhere I used REGULAR_WORD, next to it.
Example :
selected_element
: function
| REGULAR_WORD
| DATE
;
=> I don't want this solution. I don't have only "DATE" as a keyword, and I have many rules using REGULAR_WORD, so I would need to add a list of many (50+) keywords like DATE to many (20+) parser rules : it would be absolutely ugly.
PROS: make a clean tree
CONS: make a dirty grammar
2) I use a parser rule in between to get all those keywords, and then, I replace every occurrence of REGULAR_WORD by that parser rule.
Example :
word
: REGULAR_WORD
| DATE
;
selected_element
: function
| word
;
=> I do not want this solution either, as it adds one more parser rule in the tree and polluting the informations (I do not want to know that "date" is a word, I want to know that it's a selected_element, a function, a function_argument or a from_element ...
PROS: make a clean grammar
CONS: make a dirty tree
Either way, I have a dirty tree or a dirty grammar. Isn't there a way to have both clean ?
I looked for aliases, parser fragment equivalent, but it doesn't seem like ANTLR4 has any ?
Thank you, have a nice day !
There are four different grammars for SQL dialects in the Antlr4 grammar repository and all four of them use your second strategy. So it seems like there is a consensus among Antlr4 sql grammar writers. I don't believe there is a better solution given the design of the Antlr4 lexer.
As you say, that leads to a bit of noise in the full parse tree, but the relevant non-terminal (function, selected_element, etc.) is certainly present and it does not seem to me to be very difficult to collapse the unit productions out of the parse tree.
As I understand it, when Antlr4 was being designed, a decision was made to only automatically produce full parse trees, because the design of condensed ("abstract") syntax trees is too idiosyncratic to fit into a grammar DSL. So if you find an AST more convenient, you have the responsibility to generate one yourself. That's generally straight-forward although it involves a lot of boilerplate.
Other parser generators do have mechanisms which can handle "semireserved keywords". In particular, the Lemon parser generator, which is part of the Sqlite project, includes a %fallback declaration which allows you to specify that one or more tokens should be automatically reclassified in a context in which no grammar rule allows them to be used. Unfortunately, Lemon does not generate Java parsers.
Another similar option would be to use a parser generator which supports "scannerless" parsing. Such parsers typically use algorithms like Earley/GLL/GLR, capable of parsing arbitrary CFGs, to get around the need for more lookahead than can conveniently be supported in fixed-lookahead algorithms such as LALR(1).
This is the socalled keywords-as-identifiers problem and has been discussed many times before. For instance I asked a similar question already 6 years ago in the ANTLR mailing list. But also here at Stackoverflow there are questions touching this area, for instance Trying to use keywords as identifiers in ANTLR4; not working.
Terence Parr wrote a wiki article for ANTLR3 in 2008 that shortly describes 2 possible solutions:
This grammar allows "if if call call;" and "call if;".
grammar Pred;
prog: stat+ ;
stat: keyIF expr stat
| keyCALL ID ';'
| ';'
;
expr: ID
;
keyIF : {input.LT(1).getText().equals("if")}? ID ;
keyCALL : {input.LT(1).getText().equals("call")}? ID ;
ID : 'a'..'z'+ ;
WS : (' '|'\n')+ {$channel=HIDDEN;} ;
You can make those semantic predicates more efficient by intern'ing those strings so that you can do integer comparisons instead of string compares.
The other alternative is to do something like this
identifier : KEY1 | KEY2 | ... | ID ;
which is a set comparison and should be faster.
Normally, as #rici already mentioned, people prefer the solution where you keep all keywords in an own rule and add that to your normal identifier rule (where such a keyword is allowed).
The other solution in the wiki can be generalized for any keyword, by using a lookup table/list in an action in the ID lexer rule, which is used to check if a given string is a keyword. This solution is not only slower, but also sacrifies clarity in your parser grammar, since you can no longer use keyword tokens in your parser rules.

ANTLR4 : ordering problem of parser rules for a keyword used in several rules (AND, BETWEEN AND)

I am having a problem while parsing some SQL typed string with ANTLR4.
The parsed string is :
WHERE a <> 17106
AND b BETWEEN c AND d
AND e BTW(f, g)
Here is a snippet of my grammar :
where_clause
: WHERE element
;
element
: element NOT_EQUAL_INFERIOR element
| element BETWEEN element AND element
| element BTW LEFT_PARENTHESIS element COMMA_CHAR element RIGHT_PARENTHESIS
| element AND element
| WORD
;
NOT_EQUAL_INFERIOR: '<>';
LEFT_PARENTHESIS: '(';
RIGHT_PARENTHESIS: ')';
COMMA_CHAR: ',';
BETWEEN: B E T W E E N;
BTW: B T W;
WORD ... //can be anything ... it doesn't matter for the problem.
(source: hostpic.xyz)
But as you can see on that same picture, the tree is not the "correct one".
ANTLR4 being greedy, it englobes everything that follows the BETWEEN in a single "element", but we want it to only take "c" and "d".
Naturally, since it englobes everything in the element rule, it is missing the second AND of the BETWEEN, so it fails.
I have tried changing order of the rules (putting AND before BETWEEN), I tried changing association to right to those rules (< assoc=right >), but those didn't work. They change the tree but don't make it the way I want it to be.
I feel like the error is a mix of greediness, association, recursivity ... Makes it quite difficult to look for the same kind of issue, but maybe I'm just missing the correct words.
Thanks, have a nice day !
I think you misuse the rule element. I don't think SQL allows you to put anything as left and right limits of BETWEEN.
Not tested, but I'd try this:
expression
: expression NOT_EQUAL_INFERIOR expression
| term BETWEEN term AND term
| term BTW LEFT_PARENTHESIS term COMMA_CHAR term RIGHT_PARENTHESIS
| expression AND expression
| term
;
term
: WORD
;
Here your element becomes expression in most places, but in others it becomes term. The latter is a dummy rule for now, but I'm pretty sure you'd want to also add e.g. literals to it.
Disclaimer: I don't actually use ANTLR (I use my own), and I haven't worked with the (rather hairy) SQL grammar in a while, so this may be off the mark, but I think to get what you want you'll have to do something along the lines of:
...
where_clause
: WHERE disjunction
;
disjunction
: conjunction OR disjunction
| conjunction
;
conjunction
: element AND conjunction
| element
;
element
: element NOT_EQUAL_INFERIOR element
| element BETWEEN element AND element
| element BTW LEFT_PARENTHESIS element COMMA_CHAR element RIGHT_PARENTHESIS
| WORD
;
...
This is not the complete refactoring needed but illustrates the first steps.

Esper - pattern detection

I have a question for the community regarding pattern detection with Esper.
Suppose you want to detect the following pattern among a collection of data : A B C
However, it is possible, that in the actual data, you might have: A,B,D,E,C. My goal is to design a rule that could still detect A B C by keeping A B in memory, and fire the alert as soon as it sees C.
Is it possible to do this? With the standard select * from pattern(a = event -> b= event -> c=event), It only outputs when the three are in sequence in the data, but not when there are other useless data between them
With the standard "select * from pattern [a=A -> b=B]" there can be any events between A and B. Your statement is therefore wrong. I think you are confused about how to remove useless data. Use a filter such as "a=event(...not useless...) -> b=event(...not useless...)". Within the parens place the filter expressions that distinguish between useless and not useless events, i.e. "a=event(amount>10)" or whatever.

antlr add syntactic predicate

For the following rule :
switchBlockLabels
: ^(SWITCH_BLOCK_LABEL_LIST switchCaseLabel* switchDefaultLabel? switchCaseLabel*)
;
I got an error:"rule switchBlockLabels has non-LL descision due to recursive rule invocations reachable from alts 1,2".And I tried to add syntactic predicate to solve this problem.I read the book "The Definitive ANTLR Reference".And Now I am confused that since there is no alternatives in rule switchBlockLabels,then no decision need to be made on which one to choose.
Is anyone can help me?
Whenever the tree parser stumbles upon, say, 2 switchCaseLabels (and no switchDefaultLabel in the middle), it does not know to which these switchCaseLabels belong. There are 3 possibilities the parser can choose from:
2 switchCaseLabels are matched by the 1st switchCaseLabel*;
2 switchCaseLabels are matched by the 2nd switchCaseLabel*;
1 switchCaseLabel is matched by the 1st switchCaseLabel*, and one by the 2nd switchCaseLabel*.
and since the parser does not like to choose for you, it emits an error.
You need to do something like this instead:
switchBlockLabels
: ^(SWITCH_BLOCK_LABEL_LIST switchCaseLabel* (switchDefaultLabel switchCaseLabel*)?)
;
That way, when there are only switchCaseLabels, and no switchDefaultLabel, these switchCaseLabels would be always matched by the first switchCaseLabel*: there is no ambiguity anymore.

Writing an ANTLR action "in between" multiplicity

I'm working on an ANTLR grammar that looks like...
A : B+;
...and I'd like to be able to perform an action before and after each instance of B. For example, I'd like something like...
A : A {out("Before");} B {out("After");}
| {out("Before");} B {out("After");};
So that on the input stream A B B I would see the output...
Before
After
Before
After
Of course the second example isn't valid ANTLR syntax because of the left recursive rule. Is there a way to accomplish what I want with proper ANTLR syntax?
I should also mention that there are other ways of reaching the B rule so simply surrounding the B rule with before and after won't work.
Doesn't something like
A : ({out("Before");} B {out("After");})+;
work?