Should lexer rules be unambiguous in Antlr4?
Suppose I would like to parse dates and defined
hour: DIGIT09 | (DIGIT1 DIGIT09) | (DIGIT2 DIGIT04);
month: DIGIT19 | (DIGIT1 DIGIT02);
DIGIT12: '1'..'2';
DIGIT1: '1';
DIGIT2: '2';
DIGIT19: '1'..'9';
DIGIT09: '0'..'9';
DIGIT04: '0'..'4';
DIGIT04: '0'..'2';
Here I defined digit ranges in lexer. But looks like it doesn't work, since they are ambiguous.
Can I define ranges in parser instead of lexer?
This type of validation is best performed in a listener or visitor which executes after a parse tree is created. Start with just a number:
NUMBER : [0-9]+;
Then define hour and month based on this:
hour : NUMBER;
month : NUMBER;
After you have a parse tree, implement enterHour and enterMonth to validate that the NUMBER contained in each is valid.
This approach yields the best combination of error recovery and error reporting in the event the user enters incorrect input.
Related
I am looking for a solution to a simple problem.
The example :
SELECT date, date(date)
FROM date;
This is a rather stupid example where a table, its column, and a function all have the name "date".
The snippet of my grammar (very simplified) :
simple_select
: SELECT selected_element (',' selected_element) FROM from_element ';'
;
selected_element
: function
| REGULAR_WORD
;
function
: REGULAR_WORD '(' function_argument ')'
;
function_argument
: REGULAR_WORD
;
from_element
: REGULAR_WORD
;
DATE: D A T E;
FROM: F R O M;
SELECT: S E L E C T;
REGULAR_WORD
: (SIMPLE_LETTER) (SIMPLE_LETTER | '0'..'9')*
;
fragment SIMPLE_LETTER
: 'a'..'z'
| 'A'..'Z'
;
DATE is a keyword (it is used somewhere else in the grammar).
If I want it to be recognised by my grammar as a normal word, here are my solutions :
1) I add it everywhere I used REGULAR_WORD, next to it.
Example :
selected_element
: function
| REGULAR_WORD
| DATE
;
=> I don't want this solution. I don't have only "DATE" as a keyword, and I have many rules using REGULAR_WORD, so I would need to add a list of many (50+) keywords like DATE to many (20+) parser rules : it would be absolutely ugly.
PROS: make a clean tree
CONS: make a dirty grammar
2) I use a parser rule in between to get all those keywords, and then, I replace every occurrence of REGULAR_WORD by that parser rule.
Example :
word
: REGULAR_WORD
| DATE
;
selected_element
: function
| word
;
=> I do not want this solution either, as it adds one more parser rule in the tree and polluting the informations (I do not want to know that "date" is a word, I want to know that it's a selected_element, a function, a function_argument or a from_element ...
PROS: make a clean grammar
CONS: make a dirty tree
Either way, I have a dirty tree or a dirty grammar. Isn't there a way to have both clean ?
I looked for aliases, parser fragment equivalent, but it doesn't seem like ANTLR4 has any ?
Thank you, have a nice day !
There are four different grammars for SQL dialects in the Antlr4 grammar repository and all four of them use your second strategy. So it seems like there is a consensus among Antlr4 sql grammar writers. I don't believe there is a better solution given the design of the Antlr4 lexer.
As you say, that leads to a bit of noise in the full parse tree, but the relevant non-terminal (function, selected_element, etc.) is certainly present and it does not seem to me to be very difficult to collapse the unit productions out of the parse tree.
As I understand it, when Antlr4 was being designed, a decision was made to only automatically produce full parse trees, because the design of condensed ("abstract") syntax trees is too idiosyncratic to fit into a grammar DSL. So if you find an AST more convenient, you have the responsibility to generate one yourself. That's generally straight-forward although it involves a lot of boilerplate.
Other parser generators do have mechanisms which can handle "semireserved keywords". In particular, the Lemon parser generator, which is part of the Sqlite project, includes a %fallback declaration which allows you to specify that one or more tokens should be automatically reclassified in a context in which no grammar rule allows them to be used. Unfortunately, Lemon does not generate Java parsers.
Another similar option would be to use a parser generator which supports "scannerless" parsing. Such parsers typically use algorithms like Earley/GLL/GLR, capable of parsing arbitrary CFGs, to get around the need for more lookahead than can conveniently be supported in fixed-lookahead algorithms such as LALR(1).
This is the socalled keywords-as-identifiers problem and has been discussed many times before. For instance I asked a similar question already 6 years ago in the ANTLR mailing list. But also here at Stackoverflow there are questions touching this area, for instance Trying to use keywords as identifiers in ANTLR4; not working.
Terence Parr wrote a wiki article for ANTLR3 in 2008 that shortly describes 2 possible solutions:
This grammar allows "if if call call;" and "call if;".
grammar Pred;
prog: stat+ ;
stat: keyIF expr stat
| keyCALL ID ';'
| ';'
;
expr: ID
;
keyIF : {input.LT(1).getText().equals("if")}? ID ;
keyCALL : {input.LT(1).getText().equals("call")}? ID ;
ID : 'a'..'z'+ ;
WS : (' '|'\n')+ {$channel=HIDDEN;} ;
You can make those semantic predicates more efficient by intern'ing those strings so that you can do integer comparisons instead of string compares.
The other alternative is to do something like this
identifier : KEY1 | KEY2 | ... | ID ;
which is a set comparison and should be faster.
Normally, as #rici already mentioned, people prefer the solution where you keep all keywords in an own rule and add that to your normal identifier rule (where such a keyword is allowed).
The other solution in the wiki can be generalized for any keyword, by using a lookup table/list in an action in the ID lexer rule, which is used to check if a given string is a keyword. This solution is not only slower, but also sacrifies clarity in your parser grammar, since you can no longer use keyword tokens in your parser rules.
I need an idea how to express a statement like the following:
Int<Double<Float>>
So, in an abstract form we should have:
1.(easiest case): a<b>
2. case: a<a<b>>
3. case: a<a<a<b>>>
4. ....and so on...
The thing is that I should enable the possibility to embed a statement of the form a < b > within the < .. > - signs such that I have a nested statement. In other words: I should replace the b with a< b >.
The 2nd thing is that the number of the opening and closed <>-signs should be equal.
How can I do that in ANTLR ?
A rule can refer to itself without any problem¹. Let's say we have a rule type which describes your case, in a minimalist approach:
type: typeLiteral ('<' type '>')?;
typeLiteral: 'Int' | 'Double' | 'Float';
Note the ('<' type '>') is optional, denoted by the ? symbol, so using only a typeLiteral is a valid type. Here are the synta trees generated by these rules in your example Int<Double<Float>>:
¹: As long some terminals (like '<' or '>') can diferentiate when the recursion stop.
Image generated by http://ironcreek.net/phpsyntaxtree/
Is there any API in ANTLR4 for obtaining the original productions from the grammar?
For example, if there was a rule:
functionHeader : identifier LPAREN parameterDecl RPAREN
... is there some function on the parse that, given the functionHeader token would return a list ["identifier", "LPAREN", "parameterDecl", "RPAREN"]?
Well, it is rarely as simple as the list of elements you specify their but you can look at the augmented transition network (ATN) via parser.getATN() then get the rule start state etc...See ATNState
I'm making a datatype in Antlr called time, which will return a clock of the form Hour:Minute
This is what my code looks like so far:
grammar clock;
clock: HOUR ':' MINUTE
HOUR: '2'[0-3]|'1'[0-9]|[0-9];
MINUTE: [0-5][0-9]
Our code fails to recognize the "HOUR" portion, and it recognizes minute. I even changed HOUR to be the same value as minute, and it still fails to recognize HOUR. To check if our regex was wrong, we even swapped HOUR and MINUTE in the order, and did MINUTE:HOUR, and it recognized hour, but not minute. Is there something I'm missing? What's going on that it will never parse HOUR, but always MINUTE?
ANTLR lexers fully assign unambiguous token types before the parser is ever used. When multiple token types can match a token, the first one appearing in the grammar is the one that is used. For your grammar, a token cannot have the type HOUR and the type MINUTE at the same time. Since the input 12 matches both of these lexer rules, the first appearing in the grammar is used so 12 will always be an HOUR and never be a MINUTE.
Typically lexers produce integers similar to the following rule:
INT : [0-9]+;
Then a parser rule for clock might look like this:
clock : INT ':' INT;
Since you are using ANTLR 4, you can extend the generated class ClockBaseListener and override the enterClock method to perform additional validation (specifically, validating that the first INT meets the hour requirements and the second INT meets the minute requirements.
I think the ANTLR lexer is treating my attempt at a range expression "1...3" as a float. The expression "x={1...3}" is coming out of the lexer as "x={.3}" when I used the following token definitions:
FLOAT
: ('0'..'9')+ ('.' '0'..'9'+)? EXPONENT?
| ('.' '0'..'9')+ EXPONENT?
;
AUTO : '...';
When I change FLOAT to just check for integers, as so:
FLOAT : ('0'..'9')+;
then the expression "x={1...3}" is tokenized correctly. Can anyone help me to fix this?
Thanks!
I think the lexer is putting your first period into the FLOAT token and then the remain two periods do not make your AUTO token. You will need a predicate to determine if the period should be part of a float or auto token.
So why are you using three periods instead of two, must languages use two periods for a "range" and the language should determine if the period is part of a float or the range based on the following "character".
You probably need to be looking into the Defiitive ANTLR Reference on how to build your predicate for the different rules.
Hope this helps you find the correct way to complete the task.
WayneH hits on your problem. You've allowed floats in the format ".3" (without a leading 0). So, the lexer identifies the last . and the 3 and considers it a floating point number. As a result it doesn't see three dots. It sees two dots and a float.
It's very common for languages to disallow this format for floats and require that there be at least one digit (even if it's a 0) to the left of the decimal. I believe that change to your grammar would fix your problem.
There probably is a way to fix it with a predicate, but I've not yet spent enough time with ANTLR to see an obvious way to do so.
For anyone wanting to do this...
http://www.antlr.org/wiki/display/ANTLR3/Lexer+grammar+for+floating+point%2C+dot%2C+range%2C+time+specs
I can just change the language syntax to replace the "..." with a "to" keyword.