Repeat a token specific number of times in ANTLR4 - antlr

I want to design a rule that expresses date format types
like :
'ddmmyyyy'
'dd/mm/yyyy'
'mm/dd/yyyy'
'dd-mm-yyyy'
'yyyymmdd'
I tried this in Lexer, but it didn't work (token recognition error)
fragment DATE_SEPARATOR: ('/'|',');
FORMATE
: '\'' (MONTH{1,2} (DATE_SEPARATOR)? DAY{1,2} (DATE_SEPARATOR)? YEAR{1,4}
| DAY{1,2} (DATE_SEPARATOR)? MONTH{1,3} (DATE_SEPARATOR)? YEAR{1,4}
| YEAR{1,4} (DATE_SEPARATOR)? MONTH{1,3} (DATE_SEPARATOR)? DAY{1,2})
;
fragment DAY : ('d'|'D'|'day'|'DAY') ;
fragment MONTH : ('m'|'M'|'Mon'|'Month'|'mon'|'month') ;
fragment YEAR : ('y'|'Y'|'Year'|'year'|'Yr'|'yr') ;
I only called FORMATE in parser with SINGLE_QUOTE after it which refer to ' in Lexer

{1,4} is not supported by ANTLR. If you want to match something between 1 and 4 times, you need to do:
NUMBER : D (D (D D?)?)?;
fragment D : [0-9];
But when you do YEAR{1,4} and the YEAR fragment matches ('y'|'Y'|'Year'|'year'|'Yr'|'yr'), then, if {1,4} was valid, you'd successfully match "YearYrYyr" (4 times the YEAR fragment). However, that seems a bit odd.

Related

ANTLR4 grammar: getting mismatched input error

I have defined the following grammar:
grammar Test;
parse: expr EOF;
expr : IF comparator FROM field THEN #comparatorExpr
;
dateTime : DATE_TIME;
number : (INT|DECIMAL);
field : FIELD_IDENTIFIER;
op : (GT | GE | LT | LE | EQ);
comparator : op (number|dateTime);
fragment LETTER : [a-zA-Z];
fragment DIGIT : [0-9];
IF : '$IF';
FROM : '$FROM';
THEN : '$THEN';
OR : '$OR';
GT : '>' ;
GE : '>=' ;
LT : '<' ;
LE : '<=' ;
EQ : '=' ;
INT : DIGIT+;
DECIMAL : INT'.'INT;
DATE_TIME : (INT|DECIMAL)('M'|'y'|'d');
FIELD_IDENTIFIER : (LETTER|DIGIT)(LETTER|DIGIT|' ')*;
WS : [ \r\t\u000C\n]+ -> skip;
And I try to parse the following input:
$IF >=15 $FROM AgeInYears $THEN
it gives me the following error:
line 1:6 mismatched input '15 ' expecting {INT, DECIMAL, DATE_TIME}
All SO posts I found point out to the same reason for this error - identical LEXER rules. But I cannot see why 15 can be matched to either DECIMAL - it requires . between 2 ints, or to DATE_TIME - it has m|d|y suffix as well.
Any pointers would be appreciated here.
It's always a good idea to run take a look at the token stream that your Lexer produces:
grun Test parse -tokens -tree Test.txt
[#0,0:2='$IF',<'$IF'>,1:0]
[#1,4:5='>=',<'>='>,1:4]
[#2,6:8='15 ',<FIELD_IDENTIFIER>,1:6]
[#3,9:13='$FROM',<'$FROM'>,1:9]
[#4,15:25='AgeInYears ',<FIELD_IDENTIFIER>,1:15]
[#5,26:30='$THEN',<'$THEN'>,1:26]
[#6,31:30='<EOF>',<EOF>,1:31]
line 1:6 mismatched input '15 ' expecting {INT, DECIMAL, DATE_TIME}
(parse (expr $IF (comparator (op >=) 15 ) $FROM (field AgeInYears ) $THEN) <EOF>)
Here we see that "15 " (1 5 space) has been matched by the FIELD_IDENTIFIER rule. Since that's three input characters long, ANTLR will prefer that Lexer rule to the INT rule that only matches 2 characters.
For this particular input, you can solve this be reworking the FIELD_IDENTIFIER rule to be:
FIELD_IDENTIFIER: (LETTER | DIGIT)+ (' '+ (LETTER | DIGIT))*;
grun Test parse -tokens -tree Test.txt
[#0,0:2='$IF',<'$IF'>,1:0]
[#1,4:5='>=',<'>='>,1:4]
[#2,6:7='15',<INT>,1:6]
[#3,9:13='$FROM',<'$FROM'>,1:9]
[#4,15:24='AgeInYears',<FIELD_IDENTIFIER>,1:15]
[#5,26:30='$THEN',<'$THEN'>,1:26]
[#6,31:30='<EOF>',<EOF>,1:31]
(parse (expr $IF (comparator (op >=) (number 15)) $FROM (field AgeInYears) $THEN) <EOF>)
That said, I suspect that attempting to allow spaces within your FIELD_IDENTIFIER (without some sort of start/stop markers), is likely to be a continuing source of pain as you work on this. (There's a reason why you don't see this is most languages, and it's not that nobody thought it would be handy to allow for multi-word identifiers. It requires a greedy lexer rule that is likely to take precedence over other rules (as it did here)).

Not able to parse continuos string using antlr (without spaces)

I have to parse the following query using antlr
sys_nameLIKEvalue
Here sys_name is a variable which has lower case and underscores.
LIKE is a fixed key word.
value is a variable which can contain lower case uppercase as well as number.
Below the grammer rule i am using
**expression : parameter 'LIKE' values EOF;
parameter : (ID);
ID : (LOWERCASE) (LOWERCASE | UNDERSCORE)* ;
values : (VALUE);
VALUE : (LOWERCASE | NUMBER | UPPERCASE)+ ;
LOWERCASE : 'a'..'z' ;
UPPERCASE : 'A'..'Z' ;
NUMBER : '0'..'9' ;
UNDERSCORE : '_' ;**
Test Case 1
Input : sys_nameLIKEabc
error thrown : line 1:8 missing 'LIKE' at 'LIKEabc'
Test Case 2
Input : sysnameLIKEabc
error thrown : line 1:0 mismatched input 'sysnameLIKEabc' expecting ID
A literal token inside your parser rule will be translated into a plain lexer rule. So, your grammar really looks like this:
expression : parameter LIKE values EOF;
parameter : ID;
values : VALUE;
LIKE : 'LIKE';
ID : LOWERCASE (LOWERCASE | UNDERSCORE)* ;
VALUE : (LOWERCASE | NUMBER | UPPERCASE)+ ;
// Fragment rules will never become tokens of their own: good practice!
fragment LOWERCASE : 'a'..'z' ;
fragment UPPERCASE : 'A'..'Z' ;
fragment NUMBER : '0'..'9' ;
fragment UNDERSCORE : '_' ;
Since lexer rules are greedy, and if two or more lexer rules match the same amount of character the first will "win", your input is tokenized as follows:
Input: sys_nameLIKEabc, 2 tokens:
sys_name: ID
LIKEabc: VALUE
Input: sysnameLIKEabc, 1 token:
sys_nameLIKEabc: VALUE
So, the token LIKE will never be created with your test input, so none of your parser rule will ever match. It also seems a bit odd to parse input without any delimiters, like spaces.
To fix your issue, you will either have to introduce delimiters, or disallow your VALUE to contain uppercases.

Grammar for ANLTR 4

I'm trying to develop a grammar to parse a DSL using ANTLR4 (first attempt at using it)
The grammar itself is somewhat similar to SQL in the sense that should
It should be able to parse commands like the following:
select type1.attribute1 type2./xpath_expression[#id='test 1'] type3.* from source1 source2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where (type1.attribute2 = "XX" AND
(type1.attribute3 <= "2014-01-12T00:00:00.123456+00:00" OR
type2./another_xpath_expression = "YY"))
EDIT: I've updated the grammar switching CHAR, SYMBOL and DIGIT to fragment as suggested by [lucas_trzesniewski], but I did not manage to get improvements.
Attached is the parse tree as suggested by Terence. I get also in the console the following (I'm getting more confused...):
warning(125): API.g4:16:8: implicit definition of token 'CHAR' in parser
warning(125): API.g4:20:31: implicit definition of token 'SYMBOL' in parser
line 1:12 mismatched input 'p' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:19 mismatched input 't' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:27 mismatched input 'm' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:35 mismatched input '#' expecting {NUMBER, CHAR, SYMBOL}
line 1:58 no viable alternative at input 'm'
line 3:13 no viable alternative at input '(deco.m'
I was able to put together the bulk of the grammar, but it fails to properly match all the tokens, therefore resulting in incorrect parsing depending on the complexity of the input.
By browsing on internet it seems to me that the main reason is down to the lexer selecting the longest matching sequence, but even after several attempts of rewriting lexer and grammar rules I could not achieve a robust set.
Below are my grammar and some test cases.
What would be the correct way to specify the rules? should I use lexer modes ?
GRAMMAR
grammar API;
get : K_SELECT (((element) )+ | '*')
'from' (source )+
( K_FROM_DATE dateTimeOffset )? ( K_TO_DATE dateTimeOffset )?
('where' expr )?
EOF
;
element : qualifier DOT attribute;
qualifier : 'raw' | 'std' | 'deco' ;
attribute : ( word | xpath | '*') ;
word : CHAR (CHAR | NUMBER)*;
xpath : (xpathFragment+);
xpathFragment
: '/' ( DOT | CHAR | NUMBER | SYMBOL )+
| '[' (CHAR | NUMBER | SYMBOL )+ ']'
;
source : ( 'system1' | 'system2' | 'ALL') ; // should be generalised.
date : (NUMBER MINUS NUMBER MINUS NUMBER) ;
time : (NUMBER COLON NUMBER (COLON NUMBER ( DOT NUMBER )?)? ( 'Z' | SIGN (NUMBER COLON NUMBER )));
dateTimeOffset : date 'T' time;
filter : (element OP value) ;
value : QUOTE .+? QUOTE ;
expr
: filter
| '(' expr 'AND' expr ')'
| '(' expr 'OR' expr ')'
;
K_SELECT : 'select';
K_RANGE : 'range';
K_FROM_DATE : 'fromDate';
K_TO_DATE : 'toDate' ;
QUOTE : '"' ;
MINUS : '-';
SIGN : '+' | '-';
COLON : ':';
COMMA : ',';
DOT : '.';
OP : '=' | '<' | '<=' | '>' | '>=' | '!=';
NUMBER : DIGIT+;
fragment DIGIT : ('0'..'9');
fragment CHAR : [a-z] | [A-Z] ;
fragment SYMBOL : '#' | [-_=] | '\'' | '/' | '\\' ;
WS : [ \t\r\n]+ -> skip ;
NONWS : ~[ \t\r\n];
TEST 1
select raw./priobj/tradeid/margin[#id='222'] deco.* deco.marginType from system1 system2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where ( deco.marginType >= "MV" AND ( ( raw.CretSysInst = "RMS_EXODUS" OR deco.ExtSysNum <= "1234" ) OR deco.ExtSysStr = "TEST Spaced" ) )
TEST 2
select * from ALL
TEST 3
select deco./xpath/expr/text() deco./xpath/expr[a='3' and b gt '6] raw.* from ALL where raw.attr3 = "myvalue"
The image shows that my grammar is unable to recognise several parts of the commands
What is a bit puzzling me is that the single parts are instead working properly,
e.g. parsing only the 'expr' as shown by the tree below
That kind of thing: word : (CHAR (CHAR | NUMBER)+); is indeed a job for the lexer, not the parser.
This: DIGIT : ('0'..'9'); should be a fragment. Same goes for this: CHAR : [a-z] | [A-Z] ;. That way, you could write NUMBER : CHAR+;, and WORD: CHAR (CHAR | NUMBER)*;
The reason is simple: you want to deal with meaningful tokens in your parser, not with parts of words. Think of the lexer as the thing that will "cut" the input text at meaningful points. Later on, you want to process full words, not individual characters. So think about where is it most meaningful to make those cuts.
Now, as the ANTLR master has pointed out, to debug your problem, dump the parse tree and see what goes on.

Simple Antlr3 Token parsing

while i'm somewhat comforted by the amount of questions regarding Antlr grammar (it's not just me trying to shave this yak shaped thing), i haven't found a question/answer that comes close to helping with my issue.
I'm using Antlr3.3 with a mixed Token/Parser lexer.
I'm using gUnit to help prove the grammar, and some jUnit tests; this is where the fun begins.
I have a simple config file i want to parse:
identifier foobar {
port=8080
stub plusone.google.com {
status-code = 206
header = []
body = []
}
}
I'm having trouble parsing the "identifier" (foobar in this example):
Valid names i want to allow are:
foobar
foo-bar
foo_bar
foobar2
foo-bar2
foo_bar2
3foobar
_foo-bar3
and so on, therefore a valid name can use the characters 'a..z'|'A..Z', '0..9' '_' and '-'
The grammar i've arrived at is this (note this isnt the full grammar, just the portion pertinent to this question):
fragment HYPHEN : '-' ;
fragment UNDERSCORE : '_' ;
fragment DIGIT : '0'..'9' ;
fragment LETTER : 'a'..'z' |'A'..'Z' ;
fragment NUMBER : DIGIT+ ;
fragment WORD : LETTER+ ;
IDENTIFIER : DIGIT | LETTER (LETTER | DIGIT | HYPHEN | UNDERSCORE)*;
and the corresponding gUnit test
IDENTIFIER:
"foobar" OK
"foo_bar" OK
"foo-bar" OK
"foobar1" OK
"foobar12" OK
"foo-bar2" OK
"foo_bar2" OK
"foo-bar-2" OK
"foo-bar_2" OK
"5foobar" OK
"f_2-a" OK
"aA0_" OK
// no "funny chars"
"foo#bar" FAIL
// not with whitepsace
"foo bar" FAIL
Running the gUnit tests only fails for "5foobar". I've managed to parse the difficult stuff, and yet the seemingly simple task of parsing an identifier has beaten me.
Can anyone point me to where i'm going wrong? How can i match without being greedy?
Many thanks in advance.
-- UPDATE --
I changed the grammar as per Barts answer, to this:
IDENTIFIER : ('0'..'9'| 'a'..'z'|'A'..'Z' | '_'|'-') ('_'|'-'|'a'..'z'|'A'..'Z'|'0'..'9')* ;
and this fixed the failing gUnit tests, but broke an unreleated jUnit test, that tests the "port" parameter.
The following grammar deals with the "port=8080" element of the config snippet above:
configurationStatement[MiddlemanConfiguration config]
: PORT EQ port=NUMBER {
config.setConfigurationPort(Integer.parseInt(port.getText())); }
| def=proxyDefinition { config.add(def); }
;
The message i get is:
mismatched input '8080' expecting NUMBER
Where NUMBER is defined as NUMBER : ('0'..'9')+ ;
Moving the rule for NUMBER above the IDENTIFIER block, fixed this issue.
IDENTIFIER : DIGIT | LETTER (LETTER | DIGIT | HYPHEN | UNDERSCORE)*;
is equivalent to:
IDENTIFIER
: DIGIT
| LETTER (LETTER | DIGIT | HYPHEN | UNDERSCORE)*
;
So, an IDENTIFIER is eiter a single DIGIT, or starts with a LETTER followed by (LETTER | DIGIT | HYPHEN | UNDERSCORE)*.
You probably meant:
IDENTIFIER
: (DIGIT | LETTER | UNDERSCORE) (LETTER | DIGIT | HYPHEN | UNDERSCORE)*
;
However, that also allows for 3---3 as being a valid IDENTIFIER, is that correct?

ANTLR : A lexer or a parser error?

I wrote a simple lexer in ANTLR and the grammer for ID is something like this :
ID : (('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*|'_'('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*);
(No digits are allowed at the beginning)
when I generated the code (in java) and tested the input :
3a
I expected an error but the input was recognized as "INT ID" , how can i fix the grammer to make it report an error ?(with only lexer rules)
Thanks for your attention
Note that your rule could be rewritten into:
ID
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '0'..'9' |'_')*
;
or with fragments (rules that won't produce tokens, but are only used by other lexer rules):
ID
: (Letter | '_') (Letter| Digit |'_')*
;
fragment Letter
: 'a'..'z'
| 'A'..'Z'
;
fragment Digit
: '0'..'9'
;
But if input like "3a" is recognized by your lexer and produces the tokens INT and ID, then you shouldn't change anything. A problem with such input would probably come up in your parser rule(s) because it is semantically incorrect.
If you really want to let the lexer handle this kind of stuff, you could do something like this:
INT
: Digit+ (Letter {/* throw an exception */})?
;
And if you want to allow INT literals to possibly end with a f or L, then you'd first have to inspect the contents of Letter and if it's not "f" or "L", the you throw an exception.