I'm making a datatype in Antlr called time, which will return a clock of the form Hour:Minute
This is what my code looks like so far:
grammar clock;
clock: HOUR ':' MINUTE
HOUR: '2'[0-3]|'1'[0-9]|[0-9];
MINUTE: [0-5][0-9]
Our code fails to recognize the "HOUR" portion, and it recognizes minute. I even changed HOUR to be the same value as minute, and it still fails to recognize HOUR. To check if our regex was wrong, we even swapped HOUR and MINUTE in the order, and did MINUTE:HOUR, and it recognized hour, but not minute. Is there something I'm missing? What's going on that it will never parse HOUR, but always MINUTE?
ANTLR lexers fully assign unambiguous token types before the parser is ever used. When multiple token types can match a token, the first one appearing in the grammar is the one that is used. For your grammar, a token cannot have the type HOUR and the type MINUTE at the same time. Since the input 12 matches both of these lexer rules, the first appearing in the grammar is used so 12 will always be an HOUR and never be a MINUTE.
Typically lexers produce integers similar to the following rule:
INT : [0-9]+;
Then a parser rule for clock might look like this:
clock : INT ':' INT;
Since you are using ANTLR 4, you can extend the generated class ClockBaseListener and override the enterClock method to perform additional validation (specifically, validating that the first INT meets the hour requirements and the second INT meets the minute requirements.
Related
Antlr4 has always been a kind of love-hate relationship for me, but I am currently a bit perplexed. I have started creating a grammar to my best knowledge and then wanted to test it and it didnt work at all. I then reduced it a lot to just a bare minimum example and I managed to make it not work. This is my grammar:
grammar SwiftMtComponentFormat;
separator : ~ZERO EOF;
ZERO : '0';
In my understanding it should anything except a '0' and then expect the end of the file. I have been testing it with the single character input '1' which I had expected to work. However this is what happens:
If i change the ~ZEROto ZERO and change my input from 1 to 0 it actually perfectly matches... For some reason the simple negation does not seem to work. I am failing to understand what the reason here is...
In a parser rule ~ZERO matches any token that is not a ZERO token. The problem in your case is that ZERO is the only type of token that you defined at all, so any other input will lead to a token recognition error and not get to the parser at all. So if you enter the input 1, the lexer will discard the 1 with a token recognition error and the parser will only see an empty token stream.
To fix this, you can simply define a lexer rule OTHER that matches any character not matched by previous lexer rules:
OTHER: .;
Note that this definition has to go after the definition of ZERO - otherwise it would match 0 as well.
Now the input 1 will produce an OTHER token and ~ZERO will match that token. Of course, you could now replace ~ZERO with OTHER and it wouldn't change anything, but once you add additional tokens, ~ZERO will match those as well whereas OTHER would not.
Should lexer rules be unambiguous in Antlr4?
Suppose I would like to parse dates and defined
hour: DIGIT09 | (DIGIT1 DIGIT09) | (DIGIT2 DIGIT04);
month: DIGIT19 | (DIGIT1 DIGIT02);
DIGIT12: '1'..'2';
DIGIT1: '1';
DIGIT2: '2';
DIGIT19: '1'..'9';
DIGIT09: '0'..'9';
DIGIT04: '0'..'4';
DIGIT04: '0'..'2';
Here I defined digit ranges in lexer. But looks like it doesn't work, since they are ambiguous.
Can I define ranges in parser instead of lexer?
This type of validation is best performed in a listener or visitor which executes after a parse tree is created. Start with just a number:
NUMBER : [0-9]+;
Then define hour and month based on this:
hour : NUMBER;
month : NUMBER;
After you have a parse tree, implement enterHour and enterMonth to validate that the NUMBER contained in each is valid.
This approach yields the best combination of error recovery and error reporting in the event the user enters incorrect input.
I'm currently attempting to write a UCUM parser using ANTLR4. My current approach has involved defining every valid unit and prefix as a token.
Here's a very small subset of the defined tokens. I could make a cut-down version of the grammar as an example, but it seems like it shouldn't be necessary to resolve this problem (or to point out that I'm going about this entirely the wrong way).
MILLI_OR_METRE: 'm' ;
OSMOLE: 'osm' ;
MONTH: 'mo' ;
SECOND: 's' ;
One of the standard testcases is mosm, from which the lexer should generate the token stream MILLI_OR_METRE OSMOLE. Unfortunately, because ANTLR preferentially matches longer tokens, it generates the token stream MONTH SECOND MILLI_OR_METRE, which then causes the parser to raise an error.
Is it possible to make an ANTLR4 lexer try to match using shorter tokens first? Adding lookahead-type rules to MONTH isn't a great solution, as there are all sorts of potential lexing conflicts that I'd need to take account of (for example mol being lexed as MONTH LITRE instead of MOLE and so on).
EDIT:
StefanA below is of course correct; this is a job for a parser capable of backtracking (eg. recursive descent, packrat, PEG and probably various others... Coco/R is one reasonable package to do this). In an attempt to avoid adding a dependency on another parser generator (or moving other bits of the project from ANTLR to this new generator) I've hacked my way around the problem like this:
MONTH: 'mo' { _input.La(1) != 's' && _input.La(1) != 'l' && _input.La(1) != '_' }? ;
// (note: this is a C# project; java would use _input.LA instead)
but this isn't really a very extensible or maintainable solution, and like as not will have introduced other subtle issues I've not come across yet.
Your problem does not require smaller tokens to be preferred (In this case MONTH would never be matched). You need a backtracking behaviour dependent on the text being matched or not. Right?
ANTLR separates tokenization and parsing strictly. Consequently every solution to your problem will seem like a hack.
However other parser generators are specialized on problems like yours. Packrat Parsers (PEG) are backtracking and allow tokenization on the fly. Try out parboiled for this purpose.
Appears that the question is not being framed correctly.
I'm currently attempting to write a UCUM parser using ANTLR4. My current approach has involved defining every valid unit and prefix as a token.
But, according to the UCUM:
The expression syntax of The Unified Code for Units of Measure generates an infinite number of codes with the consequence that it is impossible to compile a table of all valid units.
The most to expect from the lexer is an unambiguous identification of the measurement string without regard to its semantic value. Similarly, a parser alone will be unable to validly select between unit sequences like MONTH LITRE and MOLE - both could reasonably apply to a leak rate - unless the problem space is statically constrained in the parser definition.
A heuristic, structural (explicitly identifying the problem space) or contextual (considering the relative nature of other units in the problem space), is most likely required to select the correct unit interpretation.
The best tool to use is the one that puts you in the best position to implement the heuristics necessary to disambiguate the unit strings. Antlr could do it using parse-tree walkers. Whether that is the appropriate approach requires further analysis.
In ANTLR4, it seems that predicates can only be placed at the front of sub-rules in order for them to cause the sub-rule to be skipped. In my grammar, some predicates depend on a token that appears near the end of the sub-rule, with one or more rule invocations in front of it. For example:
date :
{isYear(_input.LT(3).getText())}?
month day=INTEGER year=INTEGER { ... }
In this particular example, I know that month is always one single token, so it is always Token 3 that needs to be checked by isYear(). In general, though, I won't know the number of tokens making up a rule like month until runtime. Is there a way to get its token count?
There is no built-in way to get the length of the rule programmatically. You could use the documentation for ATNState in combination with the _ATN field in your parser to calculate all paths through a rule - if all paths through the rule contain the same number of tokens the you have calculated the exact number of tokens used by the rule.
I think the ANTLR lexer is treating my attempt at a range expression "1...3" as a float. The expression "x={1...3}" is coming out of the lexer as "x={.3}" when I used the following token definitions:
FLOAT
: ('0'..'9')+ ('.' '0'..'9'+)? EXPONENT?
| ('.' '0'..'9')+ EXPONENT?
;
AUTO : '...';
When I change FLOAT to just check for integers, as so:
FLOAT : ('0'..'9')+;
then the expression "x={1...3}" is tokenized correctly. Can anyone help me to fix this?
Thanks!
I think the lexer is putting your first period into the FLOAT token and then the remain two periods do not make your AUTO token. You will need a predicate to determine if the period should be part of a float or auto token.
So why are you using three periods instead of two, must languages use two periods for a "range" and the language should determine if the period is part of a float or the range based on the following "character".
You probably need to be looking into the Defiitive ANTLR Reference on how to build your predicate for the different rules.
Hope this helps you find the correct way to complete the task.
WayneH hits on your problem. You've allowed floats in the format ".3" (without a leading 0). So, the lexer identifies the last . and the 3 and considers it a floating point number. As a result it doesn't see three dots. It sees two dots and a float.
It's very common for languages to disallow this format for floats and require that there be at least one digit (even if it's a 0) to the left of the decimal. I believe that change to your grammar would fix your problem.
There probably is a way to fix it with a predicate, but I've not yet spent enough time with ANTLR to see an obvious way to do so.
For anyone wanting to do this...
http://www.antlr.org/wiki/display/ANTLR3/Lexer+grammar+for+floating+point%2C+dot%2C+range%2C+time+specs
I can just change the language syntax to replace the "..." with a "to" keyword.