What's wrong with this ANTLR grammar? - antlr

I want to parse query expressions that look like this:
Person Name=%John%
(Person Name=John% and Address=%Ontario%)
Person Fullname_3="John C. Smith"
But I'm totally new to Antlr4 and can't even figure out how to parse one single TABLE FIELD=QUERY clause. When I run the grammar below in Go as target, I get
line 1:7 mismatched input 'Name' expecting {'not', '(', FIELDNAME}
for a simple query like
Person Name=John
Why can't the Grammar parse FIELDNAME via parsing fieldsearch->field EQ searchterm->FIELDNAME?
I guess I'm misunderstanding something very fundamental here about how Antlr Grammars work, but what?
/* ANTLR Grammar for Minidb Query Language */
grammar Mdb;
start : searchclause EOF ;
searchclause
: table expr
;
expr
: fieldsearch
| unop fieldsearch
| LPAREN expr relop expr RPAREN
;
unop
: NOT
;
relop
: AND
| OR
;
fieldsearch
: field EQ searchterm
;
field
: FIELDNAME
;
table
: TABLENAME
;
searchterm
: STRING
;
AND
: 'and'
;
OR
: 'or'
;
NOT
: 'not'
;
EQ
: '='
;
LPAREN
: '('
;
RPAREN
: ')'
;
fragment VALID_ID_START
: ('a' .. 'z') | ('A' .. 'Z') | '_'
;
fragment VALID_ID_CHAR
: VALID_ID_START | ('0' .. '9')
;
TABLENAME
: VALID_ID_START VALID_ID_CHAR*
;
FIELDNAME
: VALID_ID_START VALID_ID_CHAR*
;
STRING: '"' ~('\n'|'"')* ('"' | { panic("syntax-error - unterminated string literal") } ) ;
WS
: [ \r\n\t] + -> skip
;

Try looking at the tokens produced for that input using grun Mdb tokens -tokens. It will tell you that the input consists of two table names, an equals sign and then another table name. To match your grammar it would have needed to be a table name, a field name, an equals sign and a string.
The first problem is that TABLENAME and FIELDNAME have the exact same definition. In cases where two lexer rules would produce a match of the same length on the current input, ANTLR prefers the one that comes first in the grammar. So it will never produce a FIELDNAME token. To fix that just replace both of those rules with a single ID rule. If you want to, you can then introduce parser rules tableName : ID ; and fieldName : ID ; if you want to keep the names.
The other problem is more straight forward: John simply does not match your rules for a string since it's not in quotes. If you do want to allow John as a valid search term, you might want to define it as searchterm : STRING | ID ; instead of only allowing STRINGs.

Related

Grammar for ANLTR 4

I'm trying to develop a grammar to parse a DSL using ANTLR4 (first attempt at using it)
The grammar itself is somewhat similar to SQL in the sense that should
It should be able to parse commands like the following:
select type1.attribute1 type2./xpath_expression[#id='test 1'] type3.* from source1 source2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where (type1.attribute2 = "XX" AND
(type1.attribute3 <= "2014-01-12T00:00:00.123456+00:00" OR
type2./another_xpath_expression = "YY"))
EDIT: I've updated the grammar switching CHAR, SYMBOL and DIGIT to fragment as suggested by [lucas_trzesniewski], but I did not manage to get improvements.
Attached is the parse tree as suggested by Terence. I get also in the console the following (I'm getting more confused...):
warning(125): API.g4:16:8: implicit definition of token 'CHAR' in parser
warning(125): API.g4:20:31: implicit definition of token 'SYMBOL' in parser
line 1:12 mismatched input 'p' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:19 mismatched input 't' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:27 mismatched input 'm' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:35 mismatched input '#' expecting {NUMBER, CHAR, SYMBOL}
line 1:58 no viable alternative at input 'm'
line 3:13 no viable alternative at input '(deco.m'
I was able to put together the bulk of the grammar, but it fails to properly match all the tokens, therefore resulting in incorrect parsing depending on the complexity of the input.
By browsing on internet it seems to me that the main reason is down to the lexer selecting the longest matching sequence, but even after several attempts of rewriting lexer and grammar rules I could not achieve a robust set.
Below are my grammar and some test cases.
What would be the correct way to specify the rules? should I use lexer modes ?
GRAMMAR
grammar API;
get : K_SELECT (((element) )+ | '*')
'from' (source )+
( K_FROM_DATE dateTimeOffset )? ( K_TO_DATE dateTimeOffset )?
('where' expr )?
EOF
;
element : qualifier DOT attribute;
qualifier : 'raw' | 'std' | 'deco' ;
attribute : ( word | xpath | '*') ;
word : CHAR (CHAR | NUMBER)*;
xpath : (xpathFragment+);
xpathFragment
: '/' ( DOT | CHAR | NUMBER | SYMBOL )+
| '[' (CHAR | NUMBER | SYMBOL )+ ']'
;
source : ( 'system1' | 'system2' | 'ALL') ; // should be generalised.
date : (NUMBER MINUS NUMBER MINUS NUMBER) ;
time : (NUMBER COLON NUMBER (COLON NUMBER ( DOT NUMBER )?)? ( 'Z' | SIGN (NUMBER COLON NUMBER )));
dateTimeOffset : date 'T' time;
filter : (element OP value) ;
value : QUOTE .+? QUOTE ;
expr
: filter
| '(' expr 'AND' expr ')'
| '(' expr 'OR' expr ')'
;
K_SELECT : 'select';
K_RANGE : 'range';
K_FROM_DATE : 'fromDate';
K_TO_DATE : 'toDate' ;
QUOTE : '"' ;
MINUS : '-';
SIGN : '+' | '-';
COLON : ':';
COMMA : ',';
DOT : '.';
OP : '=' | '<' | '<=' | '>' | '>=' | '!=';
NUMBER : DIGIT+;
fragment DIGIT : ('0'..'9');
fragment CHAR : [a-z] | [A-Z] ;
fragment SYMBOL : '#' | [-_=] | '\'' | '/' | '\\' ;
WS : [ \t\r\n]+ -> skip ;
NONWS : ~[ \t\r\n];
TEST 1
select raw./priobj/tradeid/margin[#id='222'] deco.* deco.marginType from system1 system2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where ( deco.marginType >= "MV" AND ( ( raw.CretSysInst = "RMS_EXODUS" OR deco.ExtSysNum <= "1234" ) OR deco.ExtSysStr = "TEST Spaced" ) )
TEST 2
select * from ALL
TEST 3
select deco./xpath/expr/text() deco./xpath/expr[a='3' and b gt '6] raw.* from ALL where raw.attr3 = "myvalue"
The image shows that my grammar is unable to recognise several parts of the commands
What is a bit puzzling me is that the single parts are instead working properly,
e.g. parsing only the 'expr' as shown by the tree below
That kind of thing: word : (CHAR (CHAR | NUMBER)+); is indeed a job for the lexer, not the parser.
This: DIGIT : ('0'..'9'); should be a fragment. Same goes for this: CHAR : [a-z] | [A-Z] ;. That way, you could write NUMBER : CHAR+;, and WORD: CHAR (CHAR | NUMBER)*;
The reason is simple: you want to deal with meaningful tokens in your parser, not with parts of words. Think of the lexer as the thing that will "cut" the input text at meaningful points. Later on, you want to process full words, not individual characters. So think about where is it most meaningful to make those cuts.
Now, as the ANTLR master has pointed out, to debug your problem, dump the parse tree and see what goes on.

What is wrong with this ANTLR Grammar? Conditional statement nested parenthesis

I've been tasked with writing a prototype of my team's DSL in Java, so I thought I would try it out using ANTLR. However I'm having problems with the 'expression' and 'condition' rules.
The DSL is already well defined so I would like to keep as close to the current spec as possible.
grammar MyDSL;
// Obviously this is just a snippet of the whole language, but it should give a
// decent view of the issue.
entry
: condition EOF
;
condition
: LPAREN condition RPAREN
| atomic_condition
| NOT condition
| condition AND condition
| condition OR condition
;
atomic_condition
: expression compare_operator expression
| expression (IS NULL | IS NOT NULL)
| identifier
| BOOLEAN
;
compare_operator
: EQUALS
| NEQUALS
| GT | LT
| GTEQUALS | LTEQUALS
;
expression
: LPAREN expression RPAREN
| atomic_expression
| PREFIX expression
| expression (MULTIPLY | DIVIDE) expression
| expression (ADD | SUBTRACT) expression
| expression CONCATENATE expression
;
atomic_expression
: SUBSTR LPAREN expression COMMA expression (COMMA expression)? RPAREN
| identifier
| INTEGER
;
identifier
: WORD
;
// Function Names
SUBSTR: 'SUBSTR';
// Control Chars
LPAREN : '(';
RPAREN : ')';
COMMA : ',';
// Literals and Identifiers
fragment DIGIT : [0-9] ;
INTEGER: DIGIT+;
fragment LETTER : [A-Za-z#$#];
fragment CHARACTER : DIGIT | LETTER | '_';
WORD: LETTER CHARACTER*;
BOOLEAN: 'TRUE' | 'FALSE';
// Arithmetic Operators
MULTIPLY : '*';
DIVIDE : '/';
ADD : '+';
SUBTRACT : '-';
PREFIX: ADD| SUBTRACT ;
// String Operators
CONCATENATE : '||';
// Comparison Operators
EQUALS : '==';
NEQUALS : '<>';
GTEQUALS : '>=';
LTEQUALS : '<=';
GT : '>';
LT : '<';
// Logical Operators
NOT : 'NOT';
AND : 'AND';
OR : 'OR';
// Keywords
IS : 'IS';
NULL: 'NULL';
// Whitespace
BLANK: [ \t\n\r]+ -> channel(HIDDEN) ;
The phrase I'm testing with is
(FOO == 115 AND (SUBSTR(BAR,2,1) == 1 OR SUBSTR(BAR,4,1) == 1))
However it is breaking on the nested parenthesis, matching the first ( with the first ) instead of the outermost (see below). In ANTLR3 I solved this with semantic predicates but it seems that ANTLR4 is supposed to have fixed the need for those.
I'd really like to keep the condition and the expression rules separate if at all possible. I have been able to get it to work when merged together in a single expression rule (based on examples here and elsewhere) but the current DSL spec has them as different and I'm trying to reduce any possible differences in behaviour.
Can anyone point out how I can get this all working while maintaining a separate rule for conditions' andexpressions`? Many thanks!
The grammar seems fine to me.
There's one thing going wrong in the lexer: the WORD token is defined before various keywords/operators causing it to get precedence over them. Place your WORD rule at the very end of your lexer rules (or at least after the last keywords which WORD could also match).

ANTLR3 grammar works with spaces but gives NoViableAltException when I omit spaces

I need to parse user input that defines queries to a system. The heart of such queries are triplets which can also be combined to form complex queries (the idea is to restrict a result set to only show entries which satisfy these queries). Here are 3 sample inputs:
field1 = simpleValueNoQuotes
field2 ~ "valueWithQuotes"
(field1 = simpleValueNoQuotes OR field2 ~ "valueWithQuotes") AND field3 = foobar
The user must use quoted values if their values contain any reserved characters like doublequotes or parentheses as well as whitespace.
So far, my grammar has handled this well enough, but now a new requirement has come up. Users should be allowed to omit the spaces, entering queries like field1=simpleValueNoQuotes. My grammar can't handle this and I can't seem to figure out why (this is my first project with antlr).
Here is my grammar in a slightly simplified form:
grammar simple;
querytree : query EOF;
query : subquery (operator subquery)* ;
subquery : leaf | composite;
operator : 'and' | 'or';
leaf : fieldname comparison value;
value : DOUBLEQUOTE_DELIMITED_VALUE | SIMPLE_VALUE;
composite : leftParenthesis query rightParenthesis;
fieldname : 'field1' | 'field2'; //this has many keywords in reality
comparison : '=' | '~';
leftParenthesis : '(';
rightParenthesis : ')';
fragment
ESCAPE : '\\' ( '"' | '\\') ;
DOUBLEQUOTE_DELIMITED_VALUE
: '"' ( ~( '"' | '\\' ) | ESCAPE )* '"'
;
SIMPLE_VALUE
: ('\u0021'|'\u0023'..'\u0027'|'\u002A'..'\u007E'|'\u00A1'..'\uFFFF')*; /*all unicode characters except control characters, doublequotes, parentheses and whitespace defined below*/
WHITESPACE
: ('\u0009'|'\u000A'|'\u000C'|'\u000D'|'\u0020'|'\u00A0')+ {$channel = HIDDEN;} /*\t, \n, \f, \r, space, nonbreaking space*/
;
Any ideas as to why this is able to parse field1 = simpleValueNoQuotes but unable to parse field1=simpleValueNoQuotes?
You forgot to exclude = from SIMPLE_VALUE, which means field1=simpleValueNoQuotes is a single SIMPLE_VALUE token.

Antlr tells me NoViableAltException,i'm clueless

I'm currently working on a ANTLR grammar that accepts sql statements. I want to use this grammar to allow programmers to create mysql queries and the application will automatically change the query into the proper format for the needed database.
For example if you use LIMIT 0,5 in your query it will automatically transform this query to the proper format for mssql
This is my grammer up until now
grammar sql2;
query
: select ';'? EOF
;
select
: 'SELECT' top? select_exp 'FROM' table ('WHERE' compare_exp)? limit?
;
compare_exp
: field CompareOperator param (BooleanOperator compare_exp)?
;
param
: '#' ID
;
top
: 'TOP' INT
;
limit
: 'LIMIT' INT (',' INT)?
;
select_exp
: field (',' select_exp)?
;
table
: '`' ID '`' ('.`' ID '`')?
| ID ('.' ID)?
;
field
: table
;
ID : ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
CompareOperator
: ( '=' | '<>' )
;
BooleanOperator
: ('AND' | 'OR')
;
INT : '0'..'9'+
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
If I test this with the input
SELECT TOP 5 bla_asdf, bal, lab FROM `x`.`y` WHERE xdf = #tf AND bla = #b LIMIT 0, 5
it stops parsing my query at the AND bla = #b, at that point it gives me an NoViableAltException...
If I put the input
SELECT TOP 5 bla_asdf, bal, lab FROM `x`.`y` WHERE xdf = #tf LIMIT 0, 5
It will give me no problems whatsoever.
I'm absolutely no expert at ANTLR but that also means that I don't see what I'm doing wrong here.
Can anybody help me on this?
Cheers
AND is tokenized as an ID, because the first lexer rule matching the longest input fragment wins.
So you should define BooleanOperator before ID.

antlr lexer rule matching a prefix of another rule

I am not sure that the issue is actually the prefixes, but here goes.
I have these two rules in my grammar (among many others)
DOT_T : '.' ;
AND_T : '.AND.' | '.and.' ;
and I need to parse strings like this:
a.eq.b.and.c.ne.d
c.append(b)
this should get lexed as:
ID[a] EQ_T ID[b] AND_T ID[c] NE_T ID[d]
ID[c] DOT_T ID[append] LPAREN_T ID[b] RPAREN_T
the error I get for the second line is:
line 1:3 mismatched character "p"; expecting "n"
It doesn't lex the . as a DOT_T but instead tries to match .and. because it sees the a after ..
Any idea on what I need to do to make this work?
UPDATE
I added the following rule and thought I'd use the same trick
NUMBER_T
: DIGIT+
( (DECIMAL)=> DECIMAL
| (KIND)=> KIND
)?
;
fragment DECIMAL
: '.' DIGIT+ ;
fragment KIND
: '.' DIGIT+ '_' (ALPHA+ | DIGIT+) ;
but when I try parsing this:
lda.eq.3.and.dim.eq.3
it gives me the following error:
line 1:9 no viable alternative at character "a"
while lexing the 3. So I'm guessing the same thing is happening as above, but the solution doesn't work in this case :S Now I'm properly confused...
Yes, that is because of the prefixed '.'-s.
Whenever the lexer stumbles upon ".a", it tries to create a AND_T token. If the characters "nd" can then not be found, the lexer tries to construct another token that starts with a ".a", which isn't present (and ANTLR produces an error). So, the lexer will not give back the character "a" and fall back to create a DOT_T token (and then an ID token)! This is how ANTLR works.
What you can do is optionally match these AND_T, EQ_T, ... inside the DOT_T rule. But still, you will need to "help" the lexer a bit by adding some syntactic predicates that force the lexer to look ahead in the character stream to be sure it can match these tokens.
A demo:
grammar T;
parse
: (t=. {System.out.printf("\%-10s '\%s'\n", tokenNames[$t.type], $t.text);})* EOF
;
DOT_T
: '.' ( (AND_T)=> AND_T {$type=AND_T;}
| (EQ_T)=> EQ_T {$type=EQ_T; }
| (NE_T)=> NE_T {$type=NE_T; }
)?
;
ID
: ('a'..'z' | 'A'..'Z')+
;
LPAREN_T
: '('
;
RPAREN_T
: ')'
;
SPACE
: (' ' | '\t' | '\r' | '\n')+ {skip();}
;
NUMBER_T
: DIGIT+ ((DECIMAL)=> DECIMAL)?
;
fragment DECIMAL : '.' DIGIT+ ;
fragment AND_T : ('AND' | 'and') '.' ;
fragment EQ_T : ('EQ' | 'eq' ) '.' ;
fragment NE_T : ('NE' | 'ne' ) '.' ;
fragment DIGIT : '0'..'9';
And if you feed the generated parser the input:
a.eq.b.and.c.ne.d
c.append(b)
the following output will be printed:
ID 'a'
EQ_T '.eq.'
ID 'b'
AND_T '.and.'
ID 'c'
NE_T '.ne.'
ID 'd'
ID 'c'
DOT_T '.'
ID 'append'
LPAREN_T '('
ID 'b'
RPAREN_T ')'
And for the input:
lda.eq.3.and.dim.eq.3
the following is printed:
ID 'lda'
EQ_T '.eq.'
NUMBER_T '3'
AND_T '.and.'
ID 'dim'
EQ_T '.eq.'
NUMBER_T '3'
EDIT
The fact that DECIMAL and KIND both start with '.' DIGIT+ is not good. Try something like this:
NUMBER_T
: DIGIT+ ((DECIMAL)=> DECIMAL ((KIND)=> KIND)?)?
;
fragment DECIMAL : '.' DIGIT+;
fragment KIND : '_' (ALPHA+ | DIGIT+); // removed ('.' DIGIT+) from this fragment
Note that the rule NUMBER_T will now never produce DECIMAL or KIND tokens. If you want that to happen, you need to change the type:
NUMBER_T
: DIGIT+ ((DECIMAL)=> DECIMAL {/*change type*/} ((KIND)=> KIND {/*change type*/})?)?
;