only char 'a' cannot be recognized in ANTLR grammar - antlr

identification for ID :
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
when I parse my rules it , only char 'a' cannot be recognised ,but 'A' or 'aa' or 'a0' or 'b' or 'c' or 'AAAZzzzxx' or .... everything else in universe except 'a' can be recognized by lexer why not 'a'??
error :
mismatched input 'a' expecting 'u0005'
thanks!

Your rule can match ZERO characters and so the lexer will go haywire. You need:
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')+
;
See the '+' instead of '*'?
Jim

Related

Why doesn't this ANTLR grammar derive the string `baba`?

Using ANTLR v4.9.3, I created the following grammar …
grammar G ;
start : s EOF ;
s : 'ba' a b ;
a : 'b' ;
b : 'a' ;
Given the above grammar, I thought that the following derivation is possible …
start → s → 'ba' a b → 'ba' 'b' b → 'ba' 'b' 'a' = 'baba'
However, my Java test program indicates a syntax error occurs when trying to parse the string baba.
Shouldn't the string baba be in the language generated by grammar G ?
Although the conclusion/answer is already in the comments, here an answers that explains it in a bit more detail.
When defining literal tokens inside parser rule (the 'ba', 'a' and 'b'), ANTLR implicitly creates the following grammar:
grammar G ;
start : s EOF ;
s : T__0 a b ;
a : T__1 ;
b : T__2 ;
T__0 : 'ba';
T__1 : 'b';
T__2 : 'a';
Now, when the lexer get the input "baba", it will create 2 T__0 tokens. The lexer is not driven by whatever the parser is trying to match. It works independently from the parser. The lexer creates tokens following these 2 rules:
try to match as many characters as possible for a rule
when 2 (or more) lexer rules match the same characters, let the one defined first "win"
Because of rule 1, it is apparent that 2 T__0 tokens are created.
As you already mentioned in a comment, removing the 'ba' token (and using 'b' followed by 'a') would resolve the issue:
grammar G ;
start : s EOF ;
s : 'b' 'a' a b ;
a : 'b' ;
b : 'a' ;
which would really be the grammar:
grammar G ;
start : s EOF ;
s : T__0 T__1 a b ;
a : T__0 ;
b : T__1 ;
T__0 : 'b';
T__1 : 'a';

What's wrong with this ANTLR grammar?

I want to parse query expressions that look like this:
Person Name=%John%
(Person Name=John% and Address=%Ontario%)
Person Fullname_3="John C. Smith"
But I'm totally new to Antlr4 and can't even figure out how to parse one single TABLE FIELD=QUERY clause. When I run the grammar below in Go as target, I get
line 1:7 mismatched input 'Name' expecting {'not', '(', FIELDNAME}
for a simple query like
Person Name=John
Why can't the Grammar parse FIELDNAME via parsing fieldsearch->field EQ searchterm->FIELDNAME?
I guess I'm misunderstanding something very fundamental here about how Antlr Grammars work, but what?
/* ANTLR Grammar for Minidb Query Language */
grammar Mdb;
start : searchclause EOF ;
searchclause
: table expr
;
expr
: fieldsearch
| unop fieldsearch
| LPAREN expr relop expr RPAREN
;
unop
: NOT
;
relop
: AND
| OR
;
fieldsearch
: field EQ searchterm
;
field
: FIELDNAME
;
table
: TABLENAME
;
searchterm
: STRING
;
AND
: 'and'
;
OR
: 'or'
;
NOT
: 'not'
;
EQ
: '='
;
LPAREN
: '('
;
RPAREN
: ')'
;
fragment VALID_ID_START
: ('a' .. 'z') | ('A' .. 'Z') | '_'
;
fragment VALID_ID_CHAR
: VALID_ID_START | ('0' .. '9')
;
TABLENAME
: VALID_ID_START VALID_ID_CHAR*
;
FIELDNAME
: VALID_ID_START VALID_ID_CHAR*
;
STRING: '"' ~('\n'|'"')* ('"' | { panic("syntax-error - unterminated string literal") } ) ;
WS
: [ \r\n\t] + -> skip
;
Try looking at the tokens produced for that input using grun Mdb tokens -tokens. It will tell you that the input consists of two table names, an equals sign and then another table name. To match your grammar it would have needed to be a table name, a field name, an equals sign and a string.
The first problem is that TABLENAME and FIELDNAME have the exact same definition. In cases where two lexer rules would produce a match of the same length on the current input, ANTLR prefers the one that comes first in the grammar. So it will never produce a FIELDNAME token. To fix that just replace both of those rules with a single ID rule. If you want to, you can then introduce parser rules tableName : ID ; and fieldName : ID ; if you want to keep the names.
The other problem is more straight forward: John simply does not match your rules for a string since it's not in quotes. If you do want to allow John as a valid search term, you might want to define it as searchterm : STRING | ID ; instead of only allowing STRINGs.

ANTLR mixup between a token 'B' and the hexnumber 'B' without prefix

I have difficulties finding a proper grammar for a language with the following structure:
W1234
B1234[16]
B6789
W6789,B
where 1234 and 6789 are any decimal number. And the ,B is any hex number without any prefix.
The grammar:
grammar Tsx7;
program returns [string code]
: ident EOF
;
ident returns [string code]
: 'W' NUMBER ( ',' ( DEC_DIGITS | HEX_DIGIT )+ )?
| 'B' NUMBER ( '[' NUMBER ']' )?
;
NUMBER: DEC_DIGITS+;
fragment DEC_DIGITS: '0'..'9';
fragment HEX_DIGITS: 'A'..'F';
With these (legal) inputs
W1,B
W34,1
I get InputMismatch or NoViableAlt exception.
Where these inputs works without problems:
W1
B2[8]
B3123
I tried several grammars but couldn't find the correct one.
Thanks for any help,
Martin
Don't use lexer fragments in parser rules. They will never show up in the parser. Make those rules normal lexer rules or define one which calls the fragment rules.

Antlr Sample Grammar Error

I am new to ANTLR, I defined the following test grammar, it's basically intended to parse a series of assignment statement like the following
x=1
y=10
=======================================================================
grammar test;
program
:
assignstatement*
;
assignstatement
:
ID '=' INT
;
ID : ('_'|'a'..'z'|'A'..'Z'|DIGIT) ('_'|'a'..'z'|'A'..'Z'|DIGIT)*;
INT: DIGIT+;
fragment DIGIT : [0-9] ; // not a token by itself
I got the following error when running the testRig
[#0,0:0='x',<1>,1:0]
[#1,2:2='=',<3>,1:2]
[#2,4:4='1',<1>,1:4]
[#3,7:7='y',<1>,2:0]
[#4,9:9='=',<3>,2:2]
[#5,11:12='10',<1>,2:4]
[#6,14:13='<EOF>',<-1>,3:0]
line 1:4 missing INT at '1'
line 2:0 extraneous input 'y' expecting '='
line 2:4 missing INT at '10'
line 3:0 mismatched input '<EOF>' expecting '='
(program (assignstatement x = <missing INT>) (assignstatement 1 y = <missing INT>) (assignstatement 10))
Can someone figure out what's causing these errors?
The lexer will never create INT tokens because your ID rule also matches tokens consisting of only digits.
Let your ID rule not be able start with a digit, and you're fine:
ID : ('_'|'a'..'z'|'A'..'Z') ('_'|'a'..'z'|'A'..'Z'|DIGIT)*;
Or the equivalent:
ID : [_a-zA-Z] [_a-zA-Z0-9]*;

Grammar for ANLTR 4

I'm trying to develop a grammar to parse a DSL using ANTLR4 (first attempt at using it)
The grammar itself is somewhat similar to SQL in the sense that should
It should be able to parse commands like the following:
select type1.attribute1 type2./xpath_expression[#id='test 1'] type3.* from source1 source2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where (type1.attribute2 = "XX" AND
(type1.attribute3 <= "2014-01-12T00:00:00.123456+00:00" OR
type2./another_xpath_expression = "YY"))
EDIT: I've updated the grammar switching CHAR, SYMBOL and DIGIT to fragment as suggested by [lucas_trzesniewski], but I did not manage to get improvements.
Attached is the parse tree as suggested by Terence. I get also in the console the following (I'm getting more confused...):
warning(125): API.g4:16:8: implicit definition of token 'CHAR' in parser
warning(125): API.g4:20:31: implicit definition of token 'SYMBOL' in parser
line 1:12 mismatched input 'p' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:19 mismatched input 't' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:27 mismatched input 'm' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:35 mismatched input '#' expecting {NUMBER, CHAR, SYMBOL}
line 1:58 no viable alternative at input 'm'
line 3:13 no viable alternative at input '(deco.m'
I was able to put together the bulk of the grammar, but it fails to properly match all the tokens, therefore resulting in incorrect parsing depending on the complexity of the input.
By browsing on internet it seems to me that the main reason is down to the lexer selecting the longest matching sequence, but even after several attempts of rewriting lexer and grammar rules I could not achieve a robust set.
Below are my grammar and some test cases.
What would be the correct way to specify the rules? should I use lexer modes ?
GRAMMAR
grammar API;
get : K_SELECT (((element) )+ | '*')
'from' (source )+
( K_FROM_DATE dateTimeOffset )? ( K_TO_DATE dateTimeOffset )?
('where' expr )?
EOF
;
element : qualifier DOT attribute;
qualifier : 'raw' | 'std' | 'deco' ;
attribute : ( word | xpath | '*') ;
word : CHAR (CHAR | NUMBER)*;
xpath : (xpathFragment+);
xpathFragment
: '/' ( DOT | CHAR | NUMBER | SYMBOL )+
| '[' (CHAR | NUMBER | SYMBOL )+ ']'
;
source : ( 'system1' | 'system2' | 'ALL') ; // should be generalised.
date : (NUMBER MINUS NUMBER MINUS NUMBER) ;
time : (NUMBER COLON NUMBER (COLON NUMBER ( DOT NUMBER )?)? ( 'Z' | SIGN (NUMBER COLON NUMBER )));
dateTimeOffset : date 'T' time;
filter : (element OP value) ;
value : QUOTE .+? QUOTE ;
expr
: filter
| '(' expr 'AND' expr ')'
| '(' expr 'OR' expr ')'
;
K_SELECT : 'select';
K_RANGE : 'range';
K_FROM_DATE : 'fromDate';
K_TO_DATE : 'toDate' ;
QUOTE : '"' ;
MINUS : '-';
SIGN : '+' | '-';
COLON : ':';
COMMA : ',';
DOT : '.';
OP : '=' | '<' | '<=' | '>' | '>=' | '!=';
NUMBER : DIGIT+;
fragment DIGIT : ('0'..'9');
fragment CHAR : [a-z] | [A-Z] ;
fragment SYMBOL : '#' | [-_=] | '\'' | '/' | '\\' ;
WS : [ \t\r\n]+ -> skip ;
NONWS : ~[ \t\r\n];
TEST 1
select raw./priobj/tradeid/margin[#id='222'] deco.* deco.marginType from system1 system2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where ( deco.marginType >= "MV" AND ( ( raw.CretSysInst = "RMS_EXODUS" OR deco.ExtSysNum <= "1234" ) OR deco.ExtSysStr = "TEST Spaced" ) )
TEST 2
select * from ALL
TEST 3
select deco./xpath/expr/text() deco./xpath/expr[a='3' and b gt '6] raw.* from ALL where raw.attr3 = "myvalue"
The image shows that my grammar is unable to recognise several parts of the commands
What is a bit puzzling me is that the single parts are instead working properly,
e.g. parsing only the 'expr' as shown by the tree below
That kind of thing: word : (CHAR (CHAR | NUMBER)+); is indeed a job for the lexer, not the parser.
This: DIGIT : ('0'..'9'); should be a fragment. Same goes for this: CHAR : [a-z] | [A-Z] ;. That way, you could write NUMBER : CHAR+;, and WORD: CHAR (CHAR | NUMBER)*;
The reason is simple: you want to deal with meaningful tokens in your parser, not with parts of words. Think of the lexer as the thing that will "cut" the input text at meaningful points. Later on, you want to process full words, not individual characters. So think about where is it most meaningful to make those cuts.
Now, as the ANTLR master has pointed out, to debug your problem, dump the parse tree and see what goes on.