Runtime values for expression parameters while using ANTLR - antlr

In the UI of a web app I am trying to display autocomplete suggestions to the users for excel functions. This is being done using ANTLR4TS typescript library. I have a grammar like given below
conditionalFunction : STRING relational_op STRING;
relational_op : EQ | LT | GT | NE | LTE | GTE | IS | IS_NOT | LIKE;
EQ : '=';
LT : '<';
GT : '>';
NE : '!=';
LTE : '<=';
GTE : '>=';
In the UI, user selects a set of columns that are of interest to them (example : 'column1','column2'...) before defining the formulas. While providing intelli-sense of the expression, is it possible to provide users auto-suggestions for the two parameters of the expression conditionalFunction from the available columns list that user selected? How to do this using ANTLR4TS ?
Please note that the column names will not follow the naming convention mentioned in the example, and can be any alphanumeric.

Code completion is not something that ANTLR does “out of the box”. It’s easy to think that the grammar should be enough information to provide code completion. Turns out, it’s a bit trickier than that.
However, given that you’re already using TypeScript, Have a look at antlr-c3 https://github.com/mike-lischke/antlr4-c3
It’s not quite like you just add it and then you get code completion, but it does a very good job of providing the response data structure you need to search symbol tables, etc. and provide code completion (of course, you’ll also need to work out how to integrate with your particular editor).
ANTLR C3 will do a great deal of the hard work for you (and with ANTLRs algorithms it’s not nearly as simple as it seems on the surface).

Related

ANTLR4 : clean grammar and tree with keywords (aliases ?)

I am looking for a solution to a simple problem.
The example :
SELECT date, date(date)
FROM date;
This is a rather stupid example where a table, its column, and a function all have the name "date".
The snippet of my grammar (very simplified) :
simple_select
: SELECT selected_element (',' selected_element) FROM from_element ';'
;
selected_element
: function
| REGULAR_WORD
;
function
: REGULAR_WORD '(' function_argument ')'
;
function_argument
: REGULAR_WORD
;
from_element
: REGULAR_WORD
;
DATE: D A T E;
FROM: F R O M;
SELECT: S E L E C T;
REGULAR_WORD
: (SIMPLE_LETTER) (SIMPLE_LETTER | '0'..'9')*
;
fragment SIMPLE_LETTER
: 'a'..'z'
| 'A'..'Z'
;
DATE is a keyword (it is used somewhere else in the grammar).
If I want it to be recognised by my grammar as a normal word, here are my solutions :
1) I add it everywhere I used REGULAR_WORD, next to it.
Example :
selected_element
: function
| REGULAR_WORD
| DATE
;
=> I don't want this solution. I don't have only "DATE" as a keyword, and I have many rules using REGULAR_WORD, so I would need to add a list of many (50+) keywords like DATE to many (20+) parser rules : it would be absolutely ugly.
PROS: make a clean tree
CONS: make a dirty grammar
2) I use a parser rule in between to get all those keywords, and then, I replace every occurrence of REGULAR_WORD by that parser rule.
Example :
word
: REGULAR_WORD
| DATE
;
selected_element
: function
| word
;
=> I do not want this solution either, as it adds one more parser rule in the tree and polluting the informations (I do not want to know that "date" is a word, I want to know that it's a selected_element, a function, a function_argument or a from_element ...
PROS: make a clean grammar
CONS: make a dirty tree
Either way, I have a dirty tree or a dirty grammar. Isn't there a way to have both clean ?
I looked for aliases, parser fragment equivalent, but it doesn't seem like ANTLR4 has any ?
Thank you, have a nice day !
There are four different grammars for SQL dialects in the Antlr4 grammar repository and all four of them use your second strategy. So it seems like there is a consensus among Antlr4 sql grammar writers. I don't believe there is a better solution given the design of the Antlr4 lexer.
As you say, that leads to a bit of noise in the full parse tree, but the relevant non-terminal (function, selected_element, etc.) is certainly present and it does not seem to me to be very difficult to collapse the unit productions out of the parse tree.
As I understand it, when Antlr4 was being designed, a decision was made to only automatically produce full parse trees, because the design of condensed ("abstract") syntax trees is too idiosyncratic to fit into a grammar DSL. So if you find an AST more convenient, you have the responsibility to generate one yourself. That's generally straight-forward although it involves a lot of boilerplate.
Other parser generators do have mechanisms which can handle "semireserved keywords". In particular, the Lemon parser generator, which is part of the Sqlite project, includes a %fallback declaration which allows you to specify that one or more tokens should be automatically reclassified in a context in which no grammar rule allows them to be used. Unfortunately, Lemon does not generate Java parsers.
Another similar option would be to use a parser generator which supports "scannerless" parsing. Such parsers typically use algorithms like Earley/GLL/GLR, capable of parsing arbitrary CFGs, to get around the need for more lookahead than can conveniently be supported in fixed-lookahead algorithms such as LALR(1).
This is the socalled keywords-as-identifiers problem and has been discussed many times before. For instance I asked a similar question already 6 years ago in the ANTLR mailing list. But also here at Stackoverflow there are questions touching this area, for instance Trying to use keywords as identifiers in ANTLR4; not working.
Terence Parr wrote a wiki article for ANTLR3 in 2008 that shortly describes 2 possible solutions:
This grammar allows "if if call call;" and "call if;".
grammar Pred;
prog: stat+ ;
stat: keyIF expr stat
| keyCALL ID ';'
| ';'
;
expr: ID
;
keyIF : {input.LT(1).getText().equals("if")}? ID ;
keyCALL : {input.LT(1).getText().equals("call")}? ID ;
ID : 'a'..'z'+ ;
WS : (' '|'\n')+ {$channel=HIDDEN;} ;
You can make those semantic predicates more efficient by intern'ing those strings so that you can do integer comparisons instead of string compares.
The other alternative is to do something like this
identifier : KEY1 | KEY2 | ... | ID ;
which is a set comparison and should be faster.
Normally, as #rici already mentioned, people prefer the solution where you keep all keywords in an own rule and add that to your normal identifier rule (where such a keyword is allowed).
The other solution in the wiki can be generalized for any keyword, by using a lookup table/list in an action in the ID lexer rule, which is used to check if a given string is a keyword. This solution is not only slower, but also sacrifies clarity in your parser grammar, since you can no longer use keyword tokens in your parser rules.

ANTLR4 : ordering problem of parser rules for a keyword used in several rules (AND, BETWEEN AND)

I am having a problem while parsing some SQL typed string with ANTLR4.
The parsed string is :
WHERE a <> 17106
AND b BETWEEN c AND d
AND e BTW(f, g)
Here is a snippet of my grammar :
where_clause
: WHERE element
;
element
: element NOT_EQUAL_INFERIOR element
| element BETWEEN element AND element
| element BTW LEFT_PARENTHESIS element COMMA_CHAR element RIGHT_PARENTHESIS
| element AND element
| WORD
;
NOT_EQUAL_INFERIOR: '<>';
LEFT_PARENTHESIS: '(';
RIGHT_PARENTHESIS: ')';
COMMA_CHAR: ',';
BETWEEN: B E T W E E N;
BTW: B T W;
WORD ... //can be anything ... it doesn't matter for the problem.
(source: hostpic.xyz)
But as you can see on that same picture, the tree is not the "correct one".
ANTLR4 being greedy, it englobes everything that follows the BETWEEN in a single "element", but we want it to only take "c" and "d".
Naturally, since it englobes everything in the element rule, it is missing the second AND of the BETWEEN, so it fails.
I have tried changing order of the rules (putting AND before BETWEEN), I tried changing association to right to those rules (< assoc=right >), but those didn't work. They change the tree but don't make it the way I want it to be.
I feel like the error is a mix of greediness, association, recursivity ... Makes it quite difficult to look for the same kind of issue, but maybe I'm just missing the correct words.
Thanks, have a nice day !
I think you misuse the rule element. I don't think SQL allows you to put anything as left and right limits of BETWEEN.
Not tested, but I'd try this:
expression
: expression NOT_EQUAL_INFERIOR expression
| term BETWEEN term AND term
| term BTW LEFT_PARENTHESIS term COMMA_CHAR term RIGHT_PARENTHESIS
| expression AND expression
| term
;
term
: WORD
;
Here your element becomes expression in most places, but in others it becomes term. The latter is a dummy rule for now, but I'm pretty sure you'd want to also add e.g. literals to it.
Disclaimer: I don't actually use ANTLR (I use my own), and I haven't worked with the (rather hairy) SQL grammar in a while, so this may be off the mark, but I think to get what you want you'll have to do something along the lines of:
...
where_clause
: WHERE disjunction
;
disjunction
: conjunction OR disjunction
| conjunction
;
conjunction
: element AND conjunction
| element
;
element
: element NOT_EQUAL_INFERIOR element
| element BETWEEN element AND element
| element BTW LEFT_PARENTHESIS element COMMA_CHAR element RIGHT_PARENTHESIS
| WORD
;
...
This is not the complete refactoring needed but illustrates the first steps.

ANTLR4 predicates with greedy * quantifier: avoid unnecessary predicate calls (lexing)

Following lexer grammar snippet is supposed to tokenize 'custom names' depending on a predicate that is defined in a class LexerHelper:
fragment NUMERICAL : [0-9];
fragment XML_NameStartChar
: [:a-zA-Z]
| '\u2070'..'\u218F'
| '\u2C00'..'\u2FEF'
| '\u3001'..'\uD7FF'
| '\uF900'..'\uFDCF'
| '\uFDF0'..'\uFFFD'
;
fragment XML_NameChar : XML_NameStartChar
| '-' | '_' | '.' | NUMERICAL
| '\u00B7'
| '\u0300'..'\u036F'
| '\u203F'..'\u2040'
;
fragment XML_NAME_FRAG : XML_NameStartChar XML_NameChar*;
CUSTOM_NAME : XML_NAME_FRAG ':' XML_NAME_FRAG {LexerHelper.myPredicate(getText())}?;
The correct match for CUSTOM_NAME is always the longest possible match. Now if the lexer encounters a custom name such as some:cname then I would like it to lex the entire string some:cname and then call the predicate once with 'some:cname' as argument.
Instead, the lexer calls the predicate with each possible 'valid' match it finds along the way, so some:c, some:cn, some:cna, some:cnam until finally some:cname.
Is there a way to change the behaviour to force antlr4 to first find the longest possible match, before calling the predicate? Alternatively, is there an efficient way for the predicate to determine that the match is not the longest one yet to simply return with false in that case?
EDIT: The funny thing about this behavior is that as long as only partial matches are passed to the predicate, the result of the predicate seems to be completely ignored by the lexer anyway. This seems oddly inefficient.
As it turns out, the behavior is known and permitted by Antlr. Antlr may or may not call predicates more than necessary (see here for more details). To avoid that behavior I am now using actions instead, which only get executed once the rule has completely and successfully matched. This allows me to e.g. switch modes in an action.

How can I hide parens in ANTLR4?

For example, input = '(1+2)*3' .
tree is like that '(expr (expr ((expr (expr 1) + (expr 2)) ))*(expr 3))'
And then, I would like to hide or delete the '(' and ')' in the tree , they are no needed any more. I try to make it , but didn't.
expr : ID LPAREN exprList? RPAREN
| '-' expr
| '!' expr
| expr op=('*'|'/') expr
| expr op=('+'|'-') expr
| ID
| INT
| LPAREN expr RPAREN //### Parens Here ####
;
LPAREN : '(' ;
RPAREN : ')' ;
What I want is** NOT** the following.
PAREN : ( '(' | ')' ) -> channel(HIDDEN)
Standard parser generator schemes separate parsing from tree building.
This allows one to design custom actions to build an AST, and fine tune its structure to the targeted langauge (including leaving out concrete syntax such as "parentheses").
The price is that one must not only specify the grammar, but one must also specify the rules for building the AST. And that makes defining a "grammar + tree builder" about twice as much work as just defining a grammar. When your grammars are tiny, this doesn't matter, but usually tiny grammars means "toy problem". With big real production gnarly grammars, this matters a lot; there's usually a bunch of initial churn in trying to get such grammars right and the AST building stuff just gets in the way during this phase. Clever people delay adding AST building rules till the churn phase is over, but that only partially works, and it turns out that you may want to reshape the grammar based on AST you want to build, so this delay actually increases the churn somewhat. One also pays a maintenance cost; if your grammar has any scale, you will change it, and then the AST building part must change, too.
My company builds a tool, the DMS Software Reengineering Toolkit, which contains a parser generator. We decided, the first to do so AFAIK, some 20 years ago, that this extra AST building step was too much work for the benefit for the many big grammars we expected to (and did) build. So we designed DMS to automatically build a concrete syntax tree as it parsed. Voila, write a grammar, get a parser and the tree is free. This decision has turned out to be a really good one.
The price is the final tree retains all the concrete syntax, e.g., the parentheses. While it may not look elegant, it turns out that this does not matter much in practice when manipulating trees (inspecting, traversing, analyzing, modifying, ...). We've traded a bit of inelegance for much easier tree building and grammar maintenance.
The ANTLR guy(!) decided for ANTRL4, unlike his previous ANTLR1/2/3 systems, to follow our lead and switch to building "ASTs" automatically from the grammar, as concrete syntax trees. (I don't know if you can actually write your own AST building rules to override the built-in feature for ANTLR4. Comments on this answer suggest that the way to get an AST from ANTLR4 is to walk the CST and build what you want. I'm not keen on that solution; it strikes me as paying the price for building and managing the AST, and also having the parsing overhead [time and space] of building the CST. If you only build small trees, maybe you don't care. For DMS, we regularly read thousands of files for processing together; space and time matter!)
For some discussion on how to make this a bit more elegant (effectively even more AST like), see my SO answer on ASTs vs. CSTs
To suppress useless tokens in the tree, use '!' symbol after corresponding tokens:
//remove ',' comma from the output
list: LISTNAME LISTMEMBER (','! LISTMEMBER)*;
from http://meri-stuff.blogspot.com/2011/09/antlr-tutorial-expression-language.html

xtext: expression/factor/term grammar

This has got to be one of those well-known examples that's somewhere on the internet, but I can't seem to find it.
I'm trying to learn XText and I figured a calculator expression parser would be a good start. But I'm getting syntax errors in my grammar:
Expression:
Term (('+'|'-') Term)*;
Term:
Factor (('*'|'/') Factor)*;
Factor:
number=Number | variable=ID | ('(' expression=Expression ')');
I get this error in the Expression and Term lines:
Multiple markers at this line
- Cannot change type twice within a rule
- An unassigned rule call is not allowed, when the 'current'
was already created.
What gives? How can I fix this? And when do I have instanceName=Rule vs. Rule entries in a grammar?
I downloaded xtext integrated with eclipse and it comes with a calculator example which does approximately what you wish called arithmetics. From what I can gather you will need to assign an associativity to your tokens. This grammar runs fine for me:
Expression:
Term (({Plus.left=current}'+'|{Minus.left=current}'-') right=Term)*;
Term:
Factor (({Multiply.left=current} '*'| {Division.left=current}'/') right=Factor)*;
Factor:
number=NUMBER | variable=ID | ('(' expression=Expression ')');
The example grammar they have for arithmetics can be viewed here. It includes a bit more than your, like function calls, but the basics are the same.