I'm making this grammar so that i can recognize streets, post cods, etc, it only gives that error, but i can't solve it.
grammar LabeledExpr;
/** The start rule; begin parsing here. */
exp: Inicio parte1 parte2 parte4 NL exp
| Inicio parte6 parte2 parte7 NL exp
|fim;
fim: /*vazio*/;
parte1: Id_Env Str Rua;
parte2: Virg Num parte3|/*vazio*/;
parte3: Andar|/*vazio*/;
parte4:Cod_Postal Str parte5;
parte5: Str |/*vazio*/;
parte6: Cod_Postal Id_Env Rua;
parte7:Str Str parte5;
Space : (' '|'\t')+ { skip(); };
Inicio : '#ID#';
Id_Env: [1-9]Nu?Nu?Nu?|'0';
Rua : '\"'('Rua'|'Av.'|'Trav.')Letra'\"';
Str : '\"'Letra'\"';
Letra: [A-Za-z ]+;
XXXX : [1-9]YYY;
YYY : Nu Nu Nu;
Andar: Num | 'R/C' | 'cave';
Cod_Postal: XXXX('-'YYY)?;
Num: [1-9]Nu*;
Nu: [0-9];
Virg:',';
NL : [\r\n]+;
Ponto: . ;
The error is:
line 1:38 mismatched input '123' expecting Num
line 2:35 mismatched input '3' expecting Num
line 3:55 mismatched input '9876' expecting Num
line 4:39 mismatched input '2623' expecting Num
Does anyone understands it?
Id_Env matches 123 as it is before Num.
Ter
You should make some of your lexer rules parser rules instead. Like Ter already pointed out, you have some lexer rules that can match the same input. This is resolved as "first wins", i.e. the topmost wins.
I'd also make Lettra a fragment since otherwise it will match things like R/C' or 'cave'.
Also note that Ponto matches any single character. Although I'm not proficient in your mother language, it sounds to me like Ponto should only match the point, so you have to write '.' instead of ..
Related
I have defined the following grammar:
grammar Test;
parse: expr EOF;
expr : IF comparator FROM field THEN #comparatorExpr
;
dateTime : DATE_TIME;
number : (INT|DECIMAL);
field : FIELD_IDENTIFIER;
op : (GT | GE | LT | LE | EQ);
comparator : op (number|dateTime);
fragment LETTER : [a-zA-Z];
fragment DIGIT : [0-9];
IF : '$IF';
FROM : '$FROM';
THEN : '$THEN';
OR : '$OR';
GT : '>' ;
GE : '>=' ;
LT : '<' ;
LE : '<=' ;
EQ : '=' ;
INT : DIGIT+;
DECIMAL : INT'.'INT;
DATE_TIME : (INT|DECIMAL)('M'|'y'|'d');
FIELD_IDENTIFIER : (LETTER|DIGIT)(LETTER|DIGIT|' ')*;
WS : [ \r\t\u000C\n]+ -> skip;
And I try to parse the following input:
$IF >=15 $FROM AgeInYears $THEN
it gives me the following error:
line 1:6 mismatched input '15 ' expecting {INT, DECIMAL, DATE_TIME}
All SO posts I found point out to the same reason for this error - identical LEXER rules. But I cannot see why 15 can be matched to either DECIMAL - it requires . between 2 ints, or to DATE_TIME - it has m|d|y suffix as well.
Any pointers would be appreciated here.
It's always a good idea to run take a look at the token stream that your Lexer produces:
grun Test parse -tokens -tree Test.txt
[#0,0:2='$IF',<'$IF'>,1:0]
[#1,4:5='>=',<'>='>,1:4]
[#2,6:8='15 ',<FIELD_IDENTIFIER>,1:6]
[#3,9:13='$FROM',<'$FROM'>,1:9]
[#4,15:25='AgeInYears ',<FIELD_IDENTIFIER>,1:15]
[#5,26:30='$THEN',<'$THEN'>,1:26]
[#6,31:30='<EOF>',<EOF>,1:31]
line 1:6 mismatched input '15 ' expecting {INT, DECIMAL, DATE_TIME}
(parse (expr $IF (comparator (op >=) 15 ) $FROM (field AgeInYears ) $THEN) <EOF>)
Here we see that "15 " (1 5 space) has been matched by the FIELD_IDENTIFIER rule. Since that's three input characters long, ANTLR will prefer that Lexer rule to the INT rule that only matches 2 characters.
For this particular input, you can solve this be reworking the FIELD_IDENTIFIER rule to be:
FIELD_IDENTIFIER: (LETTER | DIGIT)+ (' '+ (LETTER | DIGIT))*;
grun Test parse -tokens -tree Test.txt
[#0,0:2='$IF',<'$IF'>,1:0]
[#1,4:5='>=',<'>='>,1:4]
[#2,6:7='15',<INT>,1:6]
[#3,9:13='$FROM',<'$FROM'>,1:9]
[#4,15:24='AgeInYears',<FIELD_IDENTIFIER>,1:15]
[#5,26:30='$THEN',<'$THEN'>,1:26]
[#6,31:30='<EOF>',<EOF>,1:31]
(parse (expr $IF (comparator (op >=) (number 15)) $FROM (field AgeInYears) $THEN) <EOF>)
That said, I suspect that attempting to allow spaces within your FIELD_IDENTIFIER (without some sort of start/stop markers), is likely to be a continuing source of pain as you work on this. (There's a reason why you don't see this is most languages, and it's not that nobody thought it would be handy to allow for multi-word identifiers. It requires a greedy lexer rule that is likely to take precedence over other rules (as it did here)).
In Antlr, if i have a rule for example:
> someRule : COMM TYPE arg EQUAL COMMENT_TEXT;
where:
'COMM' : is '|'
'TYPE' : is ('C'|'I'|'U')
'arg' : can be 'number,number'(1,0) or only a number (1)
EQUALS : '='
COMMENT_TEXT : String
it would accept :
- | C10,1 = comment
- | U10 = comment
In my grammar this rule is a DEFINITION.
When even one of these "token" is missing, I would like a generic comment:
|C1 comment -> Generic Comment (EQUAL is missing)
|1 = comment -> Generic comment ('type' is missing)
|C = Comment -> Generic comment ('arg' is missing)
|comment ....
Grammar:
currLine : commentType | .....;
commentType: COMM (defComm | genComm);
defComm: TYPE arg EQUAL COMMENT_TEXT #defcom;
how can i say that everything else is genComm?
genComm: ....
EDIT:
One possible solution can be:
genComment:
: TYPE arg? EQUAL? COMMENT_TEXT?
| arg EQUAL? COMMENT_TEXT?
| EQUAL COMMENT_TEXT?
| COMMENT_TEXT
;
My Parser grammar:
parser grammar ParserComments;
options {
tokenVocab = LexerComments;
}
prog : (line? EOL)+;
line : comment;
comment: SINGLE_COMMENT (defComm | genericComment);
defComm: TYPE arg EQUAL COMMENT_TEXT;
arg : (argument1) (VIRGOLA argument2)?;
argument1 : numbers ;
argument2 : numbers ;
numbers : NUMBER+ ;
genericComment
: TYPE arg? EQUAL? COMMENT_TEXT?
| arg EQUAL? COMMENT_TEXT?
| EQUAL COMMENT_TEXT?
| COMMENT_TEXT
;
// ------ general ------
ignored : . ;
My Lexer grammar:
lexer grammar LexerComments;
SINGLE_COMMENT : '|' -> pushMode(COMMENT);
NUMBER : [0-9];
VIRGOLA : ',';
WS : [ \t] -> skip ;
EOL : [\r\n]+;
// ------------ Everything INSIDE a COMMENT ------------
mode COMMENT;
COMMENT_NUMBER : NUMBER -> type(NUMBER);
COMMENT_VIRGOLA : VIRGOLA -> type(VIRGOLA);
TYPE : 'I'| 'U'| 'Q';
EQUAL : '=';
COMMENT_TEXT: ('a'..'z' | 'A'..'Z')+;
WS_1 : [ \t] -> skip ;
COMMENT_EOL : EOL -> type(EOL);
but parsing:
| Q1,0 = text
I get a full context and ambiguity error in BaseErrorListener.
A couple of changes seem to give you the results you want:
1 - you need to popMode out of COMMENT mode when you encounter an EOL:
COMMENT_EOL: EOL -> type(EOL),popMode;
2 - you can use the following for your genericComment rule:
genericComment: .*?;
Basically says, match anything (but don't be greedy), so it won't match the EOL token. As a result it will take any token up to the next EOL token.
BTW... your ignored rule is purely extraneous. Parser rules are evaluated through a recursive descent calling structure. If a rule is not a start rule, it can only be accessed by being referenced by another rule.
(It's not uncommon to have a final lexer rule that matches . (i.e. anything) to catch anything that prior Lexer rules did not match. But that works because Lexer rules are not evaluated by a recursive descent algorithm)
Re: Lexer rules
The first step in ANTLR parsing your input, is to convert your input stream of characters into a stream of tokens. This process uses you Lexer rules (the rules that begin with a capital letter). At this time, the parser rules are irrelevant, the parser rules act on the stream of tokens that the Lexer produces.
When the Lexer (aka tokenizer), tokenizes your input characters, it will evaluate you input against all of your Lexer rules. When more than 1 rule can match your input, then there are two "tie-breaker" strategies:
The Lexer rule that matches the longest stream of input characters with take top priority.
If there is more than one rule that matches the same (longest) sequence of characters, then the rule that appears first "wins"
I am new to ANTLR, I defined the following test grammar, it's basically intended to parse a series of assignment statement like the following
x=1
y=10
=======================================================================
grammar test;
program
:
assignstatement*
;
assignstatement
:
ID '=' INT
;
ID : ('_'|'a'..'z'|'A'..'Z'|DIGIT) ('_'|'a'..'z'|'A'..'Z'|DIGIT)*;
INT: DIGIT+;
fragment DIGIT : [0-9] ; // not a token by itself
I got the following error when running the testRig
[#0,0:0='x',<1>,1:0]
[#1,2:2='=',<3>,1:2]
[#2,4:4='1',<1>,1:4]
[#3,7:7='y',<1>,2:0]
[#4,9:9='=',<3>,2:2]
[#5,11:12='10',<1>,2:4]
[#6,14:13='<EOF>',<-1>,3:0]
line 1:4 missing INT at '1'
line 2:0 extraneous input 'y' expecting '='
line 2:4 missing INT at '10'
line 3:0 mismatched input '<EOF>' expecting '='
(program (assignstatement x = <missing INT>) (assignstatement 1 y = <missing INT>) (assignstatement 10))
Can someone figure out what's causing these errors?
The lexer will never create INT tokens because your ID rule also matches tokens consisting of only digits.
Let your ID rule not be able start with a digit, and you're fine:
ID : ('_'|'a'..'z'|'A'..'Z') ('_'|'a'..'z'|'A'..'Z'|DIGIT)*;
Or the equivalent:
ID : [_a-zA-Z] [_a-zA-Z0-9]*;
I'm trying to develop a grammar to parse a DSL using ANTLR4 (first attempt at using it)
The grammar itself is somewhat similar to SQL in the sense that should
It should be able to parse commands like the following:
select type1.attribute1 type2./xpath_expression[#id='test 1'] type3.* from source1 source2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where (type1.attribute2 = "XX" AND
(type1.attribute3 <= "2014-01-12T00:00:00.123456+00:00" OR
type2./another_xpath_expression = "YY"))
EDIT: I've updated the grammar switching CHAR, SYMBOL and DIGIT to fragment as suggested by [lucas_trzesniewski], but I did not manage to get improvements.
Attached is the parse tree as suggested by Terence. I get also in the console the following (I'm getting more confused...):
warning(125): API.g4:16:8: implicit definition of token 'CHAR' in parser
warning(125): API.g4:20:31: implicit definition of token 'SYMBOL' in parser
line 1:12 mismatched input 'p' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:19 mismatched input 't' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:27 mismatched input 'm' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:35 mismatched input '#' expecting {NUMBER, CHAR, SYMBOL}
line 1:58 no viable alternative at input 'm'
line 3:13 no viable alternative at input '(deco.m'
I was able to put together the bulk of the grammar, but it fails to properly match all the tokens, therefore resulting in incorrect parsing depending on the complexity of the input.
By browsing on internet it seems to me that the main reason is down to the lexer selecting the longest matching sequence, but even after several attempts of rewriting lexer and grammar rules I could not achieve a robust set.
Below are my grammar and some test cases.
What would be the correct way to specify the rules? should I use lexer modes ?
GRAMMAR
grammar API;
get : K_SELECT (((element) )+ | '*')
'from' (source )+
( K_FROM_DATE dateTimeOffset )? ( K_TO_DATE dateTimeOffset )?
('where' expr )?
EOF
;
element : qualifier DOT attribute;
qualifier : 'raw' | 'std' | 'deco' ;
attribute : ( word | xpath | '*') ;
word : CHAR (CHAR | NUMBER)*;
xpath : (xpathFragment+);
xpathFragment
: '/' ( DOT | CHAR | NUMBER | SYMBOL )+
| '[' (CHAR | NUMBER | SYMBOL )+ ']'
;
source : ( 'system1' | 'system2' | 'ALL') ; // should be generalised.
date : (NUMBER MINUS NUMBER MINUS NUMBER) ;
time : (NUMBER COLON NUMBER (COLON NUMBER ( DOT NUMBER )?)? ( 'Z' | SIGN (NUMBER COLON NUMBER )));
dateTimeOffset : date 'T' time;
filter : (element OP value) ;
value : QUOTE .+? QUOTE ;
expr
: filter
| '(' expr 'AND' expr ')'
| '(' expr 'OR' expr ')'
;
K_SELECT : 'select';
K_RANGE : 'range';
K_FROM_DATE : 'fromDate';
K_TO_DATE : 'toDate' ;
QUOTE : '"' ;
MINUS : '-';
SIGN : '+' | '-';
COLON : ':';
COMMA : ',';
DOT : '.';
OP : '=' | '<' | '<=' | '>' | '>=' | '!=';
NUMBER : DIGIT+;
fragment DIGIT : ('0'..'9');
fragment CHAR : [a-z] | [A-Z] ;
fragment SYMBOL : '#' | [-_=] | '\'' | '/' | '\\' ;
WS : [ \t\r\n]+ -> skip ;
NONWS : ~[ \t\r\n];
TEST 1
select raw./priobj/tradeid/margin[#id='222'] deco.* deco.marginType from system1 system2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where ( deco.marginType >= "MV" AND ( ( raw.CretSysInst = "RMS_EXODUS" OR deco.ExtSysNum <= "1234" ) OR deco.ExtSysStr = "TEST Spaced" ) )
TEST 2
select * from ALL
TEST 3
select deco./xpath/expr/text() deco./xpath/expr[a='3' and b gt '6] raw.* from ALL where raw.attr3 = "myvalue"
The image shows that my grammar is unable to recognise several parts of the commands
What is a bit puzzling me is that the single parts are instead working properly,
e.g. parsing only the 'expr' as shown by the tree below
That kind of thing: word : (CHAR (CHAR | NUMBER)+); is indeed a job for the lexer, not the parser.
This: DIGIT : ('0'..'9'); should be a fragment. Same goes for this: CHAR : [a-z] | [A-Z] ;. That way, you could write NUMBER : CHAR+;, and WORD: CHAR (CHAR | NUMBER)*;
The reason is simple: you want to deal with meaningful tokens in your parser, not with parts of words. Think of the lexer as the thing that will "cut" the input text at meaningful points. Later on, you want to process full words, not individual characters. So think about where is it most meaningful to make those cuts.
Now, as the ANTLR master has pointed out, to debug your problem, dump the parse tree and see what goes on.
my real grammar is way more complex but I could strip down my problem. So this is the grammar:
grammar test2;
options {language=CSharp3;}
#parser::namespace { Test.Parser }
#lexer::namespace { Test.Parser }
start : 'VERSION' INT INT project;
project : START 'project' NAME TEXT END 'project';
START: '/begin';
END: '/end';
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
INT : '0'..'9'+;
NAME: ('a'..'z' | 'A'..'Z')+;
TEXT : '"' ( '\\' (.) |'"''"' |~( '\\' | '"' | '\n' | '\r' ) )* '"';
STARTA
: '/begin hello';
And I want to parse this (for example):
VERSION 1 1
/begin project
testproject "description goes here"
/end
project
Now it will not work like this (Mismatched token exception). If I remove the last Token STARTA, it works. But why? I don't get it.
Help is really appreciated.
Thanks.
When the lexer sees the input "/begin " (including the space!), it is committed to the rule STARTA. When it can't match said rule, because the next char in the input is a "p" (from "project") and not a "h" (from "hello"), it will try to match another rule that can match "/begin " (including the space!). But there is no such rule, producing the error:
mismatched character 'p' expecting 'h'
and the lexer will not give up the space and match the START rule.
Remember that last part: once the lexer has matched something, it will not give up on it. It might try other rules that match the same input, but it will not backtrack to match a rule that matches less characters!
This is simply how the lexer works in ANTLR 3.x, no way around it.