How to define non-reserved key words with antlr - sql

I'm using antlr to analyse and re-write sql query.
I have:
select : SELECT ;
fragment S : 's' | 'S' ;
....
fragment LETTER : 'a'..'z' | 'A'..'Z' ;
SELECT : S E L E C T ;
IDENTIFIER : LETTER+ ;
to define reserved key words and let them to be case-insensitive.
My question is how can I define non-reserved key words?

Your problem seems similar to the problem we had when building the parser for the Drools (www.jboss.org/drools) language (DRL). In DRL, for instance, "rule" is a keyword, but could also be used by a java programmer as a property name in his POJO. So we can't have that as a reserved keyword.
rule /*keyword*/ "my rule"
when
SomeClass( rule /*property name*/ == "foo" )
...
We called these keywords "soft keywords".
To do that in ANTLR, we defined only "true"/"false"/"null" as hard keywords in the LEXER:
https://github.com/droolsjbpm/drools/blob/master/drools-compiler/src/main/resources/org/drools/lang/DRLLexer.g#L132
Everything else is an ID. Then in the PARSER, we used semantic predicates for each soft keyword:
https://github.com/droolsjbpm/drools/blob/master/drools-compiler/src/main/resources/org/drools/lang/DRLExpressions.g#L597
This makes it possible to seamlessly integrate with java created POJOs without clashing property names and other things with Drools defined keywords.
Hope it helps.

Related

How to get a parameter to the ANTLR lexer object?

I'm writing a JAVA software to parse SQL queries. In order to do so I'm using ANTLR with presto.g4.
The code I'm currently using is pretty standard:
PrestoLexer lexer = new PrestoLexer(
new CaseChangingCharStream(CharStreams.fromString(query), true));
lexer.removeErrorListeners();
lexer.addErrorListener(errorListener);
CommonTokenStream tokens = new CommonTokenStream(lexer);
PrestoParser parser = new PrestoParser(tokens);
I wonder whether it's possible to pass a parameter to the lexer so the lexing will be different depends on that parameter?
update:
I've used #Mike's suggestion below and my lexer now inherits from the built-in lexer and added a predicate function. My issue is now pure grammar.
This is my string definition:
STRING
: '\'' ( '\\' .
| '\\\\' . {HelperUtils.isNeedSpecialEscaping(this)}? // match \ followed by any char
| ~[\\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;
I sometimes have a query with weird escaping for which the predicate returns true. For example:
select
table1(replace(replace(some_col,'\\'',''),'\"' ,'')) as features
from table1
And when I try to parse it I'm getting:
'\'',''),'
As a single string.
how can I handle this one?
I don't know what you need the parameter for, but you mentioned SQL, so let me present a solution I used since years: predicates.
In MySQL (which is the dialect I work with) the syntax differs depending on the MySQL version number. So in my grammar I use semantic predicates to switch off and on language parts that belong to a specific version. The approach is simple:
test:
{serverVersion < 80014}? ADMIN_SYMBOL
| ONLY_SYMBOL
;
The ADMIN keyword is only acceptable for version < 8.0.14 (just an example, not true in reality), while the ONLY keyword is a possible alternative in any version.
The variable serverVersion is a member of a base class from which I derive my parser. That can be specified by:
options {
superClass = MySQLBaseRecognizer;
tokenVocab = MySQLLexer;
}
The lexer also is derived from that class, so the version number is available in both lexer and parser (in addition to other important settings like the SQL mode). With this approach you can also implement more complex functions for predicates, that need additional processing.
You can find the full code + grammars at the MySQL Workbench Github repository.
I wonder whether it's possible to pass a parameter to the lexer so the lexing will be different depends on that parameter?
No, the lexer works independently from the parser. You cannot direct the lexer while parsing.

ANTLR4 : clean grammar and tree with keywords (aliases ?)

I am looking for a solution to a simple problem.
The example :
SELECT date, date(date)
FROM date;
This is a rather stupid example where a table, its column, and a function all have the name "date".
The snippet of my grammar (very simplified) :
simple_select
: SELECT selected_element (',' selected_element) FROM from_element ';'
;
selected_element
: function
| REGULAR_WORD
;
function
: REGULAR_WORD '(' function_argument ')'
;
function_argument
: REGULAR_WORD
;
from_element
: REGULAR_WORD
;
DATE: D A T E;
FROM: F R O M;
SELECT: S E L E C T;
REGULAR_WORD
: (SIMPLE_LETTER) (SIMPLE_LETTER | '0'..'9')*
;
fragment SIMPLE_LETTER
: 'a'..'z'
| 'A'..'Z'
;
DATE is a keyword (it is used somewhere else in the grammar).
If I want it to be recognised by my grammar as a normal word, here are my solutions :
1) I add it everywhere I used REGULAR_WORD, next to it.
Example :
selected_element
: function
| REGULAR_WORD
| DATE
;
=> I don't want this solution. I don't have only "DATE" as a keyword, and I have many rules using REGULAR_WORD, so I would need to add a list of many (50+) keywords like DATE to many (20+) parser rules : it would be absolutely ugly.
PROS: make a clean tree
CONS: make a dirty grammar
2) I use a parser rule in between to get all those keywords, and then, I replace every occurrence of REGULAR_WORD by that parser rule.
Example :
word
: REGULAR_WORD
| DATE
;
selected_element
: function
| word
;
=> I do not want this solution either, as it adds one more parser rule in the tree and polluting the informations (I do not want to know that "date" is a word, I want to know that it's a selected_element, a function, a function_argument or a from_element ...
PROS: make a clean grammar
CONS: make a dirty tree
Either way, I have a dirty tree or a dirty grammar. Isn't there a way to have both clean ?
I looked for aliases, parser fragment equivalent, but it doesn't seem like ANTLR4 has any ?
Thank you, have a nice day !
There are four different grammars for SQL dialects in the Antlr4 grammar repository and all four of them use your second strategy. So it seems like there is a consensus among Antlr4 sql grammar writers. I don't believe there is a better solution given the design of the Antlr4 lexer.
As you say, that leads to a bit of noise in the full parse tree, but the relevant non-terminal (function, selected_element, etc.) is certainly present and it does not seem to me to be very difficult to collapse the unit productions out of the parse tree.
As I understand it, when Antlr4 was being designed, a decision was made to only automatically produce full parse trees, because the design of condensed ("abstract") syntax trees is too idiosyncratic to fit into a grammar DSL. So if you find an AST more convenient, you have the responsibility to generate one yourself. That's generally straight-forward although it involves a lot of boilerplate.
Other parser generators do have mechanisms which can handle "semireserved keywords". In particular, the Lemon parser generator, which is part of the Sqlite project, includes a %fallback declaration which allows you to specify that one or more tokens should be automatically reclassified in a context in which no grammar rule allows them to be used. Unfortunately, Lemon does not generate Java parsers.
Another similar option would be to use a parser generator which supports "scannerless" parsing. Such parsers typically use algorithms like Earley/GLL/GLR, capable of parsing arbitrary CFGs, to get around the need for more lookahead than can conveniently be supported in fixed-lookahead algorithms such as LALR(1).
This is the socalled keywords-as-identifiers problem and has been discussed many times before. For instance I asked a similar question already 6 years ago in the ANTLR mailing list. But also here at Stackoverflow there are questions touching this area, for instance Trying to use keywords as identifiers in ANTLR4; not working.
Terence Parr wrote a wiki article for ANTLR3 in 2008 that shortly describes 2 possible solutions:
This grammar allows "if if call call;" and "call if;".
grammar Pred;
prog: stat+ ;
stat: keyIF expr stat
| keyCALL ID ';'
| ';'
;
expr: ID
;
keyIF : {input.LT(1).getText().equals("if")}? ID ;
keyCALL : {input.LT(1).getText().equals("call")}? ID ;
ID : 'a'..'z'+ ;
WS : (' '|'\n')+ {$channel=HIDDEN;} ;
You can make those semantic predicates more efficient by intern'ing those strings so that you can do integer comparisons instead of string compares.
The other alternative is to do something like this
identifier : KEY1 | KEY2 | ... | ID ;
which is a set comparison and should be faster.
Normally, as #rici already mentioned, people prefer the solution where you keep all keywords in an own rule and add that to your normal identifier rule (where such a keyword is allowed).
The other solution in the wiki can be generalized for any keyword, by using a lookup table/list in an action in the ID lexer rule, which is used to check if a given string is a keyword. This solution is not only slower, but also sacrifies clarity in your parser grammar, since you can no longer use keyword tokens in your parser rules.

ANTLR4 : ordering problem of parser rules for a keyword used in several rules (AND, BETWEEN AND)

I am having a problem while parsing some SQL typed string with ANTLR4.
The parsed string is :
WHERE a <> 17106
AND b BETWEEN c AND d
AND e BTW(f, g)
Here is a snippet of my grammar :
where_clause
: WHERE element
;
element
: element NOT_EQUAL_INFERIOR element
| element BETWEEN element AND element
| element BTW LEFT_PARENTHESIS element COMMA_CHAR element RIGHT_PARENTHESIS
| element AND element
| WORD
;
NOT_EQUAL_INFERIOR: '<>';
LEFT_PARENTHESIS: '(';
RIGHT_PARENTHESIS: ')';
COMMA_CHAR: ',';
BETWEEN: B E T W E E N;
BTW: B T W;
WORD ... //can be anything ... it doesn't matter for the problem.
(source: hostpic.xyz)
But as you can see on that same picture, the tree is not the "correct one".
ANTLR4 being greedy, it englobes everything that follows the BETWEEN in a single "element", but we want it to only take "c" and "d".
Naturally, since it englobes everything in the element rule, it is missing the second AND of the BETWEEN, so it fails.
I have tried changing order of the rules (putting AND before BETWEEN), I tried changing association to right to those rules (< assoc=right >), but those didn't work. They change the tree but don't make it the way I want it to be.
I feel like the error is a mix of greediness, association, recursivity ... Makes it quite difficult to look for the same kind of issue, but maybe I'm just missing the correct words.
Thanks, have a nice day !
I think you misuse the rule element. I don't think SQL allows you to put anything as left and right limits of BETWEEN.
Not tested, but I'd try this:
expression
: expression NOT_EQUAL_INFERIOR expression
| term BETWEEN term AND term
| term BTW LEFT_PARENTHESIS term COMMA_CHAR term RIGHT_PARENTHESIS
| expression AND expression
| term
;
term
: WORD
;
Here your element becomes expression in most places, but in others it becomes term. The latter is a dummy rule for now, but I'm pretty sure you'd want to also add e.g. literals to it.
Disclaimer: I don't actually use ANTLR (I use my own), and I haven't worked with the (rather hairy) SQL grammar in a while, so this may be off the mark, but I think to get what you want you'll have to do something along the lines of:
...
where_clause
: WHERE disjunction
;
disjunction
: conjunction OR disjunction
| conjunction
;
conjunction
: element AND conjunction
| element
;
element
: element NOT_EQUAL_INFERIOR element
| element BETWEEN element AND element
| element BTW LEFT_PARENTHESIS element COMMA_CHAR element RIGHT_PARENTHESIS
| WORD
;
...
This is not the complete refactoring needed but illustrates the first steps.

How to solve ambiguity in with keywords as identifiers in grammar kit

I've been trying to write the graphql language grammar for grammarkit and I've found myself really stuck on an ambiguity issue for quite some time now. Keywords in graphql (such as: type, implements, scalar ) can also be names of types or fields. I.E.
type type implements type {}
At first I defined these keywords as tokens in the bnf but that'd mean the case above is invalid. But if I write these keywords directly as I'm describing the rule, It results in an ambiguity in the grammar.
An example of an issue I'm seeing based on this grammar below is if you define something like this
directive #foo on Baz | Bar
scalar Foobar #cool
the PSI viewer is telling me that in the position of #cool it's expecting a DirectiveAddtlLocation, which is a rule I don't even reference in the scalar rule. Is anyone familiar with grammarkit and have encountered something like this? I'd really appreciate some insight. Thank You.
Here's an excerpt of grammar for the error example I mentioned above.
{
tokens=[
LEFT_PAREN='('
RIGHT_PAREN=')'
PIPE='|'
AT='#'
IDENTIFIER="regexp:[_A-Za-z][_0-9A-Za-z]*"
WHITE_SPACE = 'regexp:\s+'
]
}
Document ::= Definition*
Definition ::= DirectiveTypeDef | ScalarTypeDef
NamedTypeDef ::= IDENTIFIER
// I.E. #foo #bar(a: 10) #baz
DirectivesDeclSet ::= DirectiveDecl+
DirectiveDecl ::= AT TypeName
// I.E. directive #example on FIELD_DEFINITION | ARGUMENT_DEFINITION
DirectiveTypeDef ::= 'directive' AT NamedTypeDef DirectiveLocationsConditionDef
DirectiveLocationsConditionDef ::= 'on' DirectiveLocation DirectiveAddtlLocation*
DirectiveLocation ::= IDENTIFIER
DirectiveAddtlLocation ::= PIPE? DirectiveLocation
TypeName ::= IDENTIFIER
// I.E. scalar DateTime #foo
ScalarTypeDef ::= 'scalar' NamedTypeDef DirectivesDeclSet?
Once your grammar sees directive #TOKEN on IDENTIFIER, it consumes a sequence of DirectiveAddtlLocation. Each of those consists of an optional PIPE followed by an IDENTIFIER. As you note in your question, the GraphQL "keywords" are really just special cases of identifiers. So what's probably happening here is that, since you allow any token as an identifier, scalar and Foobar are both being consumed as DirectiveAddtlLocation and it's never actually getting to see a ScalarTypeDef.
# Parses the same as:
directive #foo on Bar | Baz | scalar | Foobar
#cool # <-- ?????
You can get around this by listing out the explicit set of allowed directive locations in your grammar. (You might even be able to get pretty far by just copying the grammar in Appendix B of the GraphQL spec and changing its syntax.)
DirectiveLocation ::= ExecutableDirectiveLocation | TypeSystemDirectiveLocation
ExecutableDirectiveLocation ::= 'QUERY' | 'MUTATION' | ...
TypeSystemDirectiveLocation ::= 'SCHEMA' | 'SCALAR' | ...
Now when you go to parse:
directive #foo on QUERY | MUTATION
# "scalar" is not a directive location, so the DirectiveTypeDef must end
scalar Foobar #cool
(For all that the "identifier" vs. "keyword" distinction is a little weird, I'm pretty sure the GraphQL grammar isn't actually ambiguous; in every context where a free-form identifier is allowed, there's punctuation before a "keyword" could appear again, and in cases like this one there's unambiguous lists of not-quite-keywords that don't overlap.)

Recursive rule in ANTLR

I need an idea how to express a statement like the following:
Int<Double<Float>>
So, in an abstract form we should have:
1.(easiest case): a<b>
2. case: a<a<b>>
3. case: a<a<a<b>>>
4. ....and so on...
The thing is that I should enable the possibility to embed a statement of the form a < b > within the < .. > - signs such that I have a nested statement. In other words: I should replace the b with a< b >.
The 2nd thing is that the number of the opening and closed <>-signs should be equal.
How can I do that in ANTLR ?
A rule can refer to itself without any problem¹. Let's say we have a rule type which describes your case, in a minimalist approach:
type: typeLiteral ('<' type '>')?;
typeLiteral: 'Int' | 'Double' | 'Float';
Note the ('<' type '>') is optional, denoted by the ? symbol, so using only a typeLiteral is a valid type. Here are the synta trees generated by these rules in your example Int<Double<Float>>:
¹: As long some terminals (like '<' or '>') can diferentiate when the recursion stop.
Image generated by http://ironcreek.net/phpsyntaxtree/