Lexer negation in ANTLR - why doesn't double-negation work? - antlr

I have this token I would like to match any string of non-special characters except for the literal string "TODAY".
ANTLR makes this a pain in the butt:
UNQUOTED :
( ~('T'|'t'|~UnquotedStartChar) UnquotedChar*
( ~('O'|'o'|~UnquotedChar) UnquotedChar*
| ('O'|'o')
( ~('D'|'d'|~UnquotedChar) UnquotedChar*
| ('D'|'d')
( ~('A'|'a'|~UnquotedChar) UnquotedChar*
| ('A'|'a')
( ~('Y'|'y'|~UnquotedChar) UnquotedChar*
)?
)?
)?
)?
)
;
I figured I was being clever by double-negating in the ~('T'|'t'|~UnquotedStartChar) - it should match everything that isn't 'T', 't', or non-permitted in the start char normally... which intuitively seems like it should work, and when I think about it a bit, seems like it should work, but when I actually try to compile it, I get this error message:
error(100): antlr/QueryParser.g:0:1: syntax error: buildnfa: MismatchedTreeNodeException(16!=32)
UnquotedStartChar is, in itself, quite a mess... but I'm not entirely sure it's relevant yet. Here's the first couple of levels anyway.
fragment
UnquotedStartChar
: EscapeSequence
| ~( ProhibitedAtStartOfUnquoted )
;
fragment
ProhibitedAtStartOfUnquoted
: ProhibitedInUnquoted
| Slash | PLUS | MINUS | DOLLAR;
fragment
EscapeSequence
: Backslash
( 'u' HexDigit HexDigit HexDigit HexDigit
| ~( 'u' )
)
;

In lexer grammar, rule (token) that is defined earlier takes precedence in case of a conflict, so you could just define a token for 'TODAY' above UNQUOTED and it should match the string instead of the general UNQUOTED token.

Related

Can't create a variable with just one letter

I wish the variables could be declared with only one letter in the name.
When I write Integer aa; all work, but
when I type Integer a; then grun says: mismatched input 'a' expecting ID.
I've seen the inverse problem but it didn't help. I think my code is right but I can't see where I'm wrong. This is my lexer:
lexer grammar Symbols;
...
LineComment: '//' ~[\u000A\u000D]* -> channel(HIDDEN) ;
DelimetedComment: '/*' .*? '*/' -> channel(HIDDEN) ;
String: '"' .*? '"' ;
Character: '\'' (EscapeSeq | .) '\'' ;
IntegerLiteral: '0' | (ADD?| SUB) DecDigitNoZero DecDigit+ ;
FloatLiteral: ((ADD? | SUB) (DecDigitNoZero DecDigit*)? DOT DecDigit+ | IntegerLiteral) [F] ;
DoubleLiteral: ((ADD? | SUB) (DecDigitNoZero DecDigit*)? DOT DecDigit+ | IntegerLiteral) [D] ;
LongLiteral: IntegerLiteral [L] ;
HexLiteral: '0' [xX] HexDigit (HexDigit | UNDERSCORE)* ;
BinLiteral: '0' [bB] BinDigit (BinDigit | UNDERSCORE)* ;
OctLiteral: '0' [cC] OctDigit (OctDigit | UNDERSCORE)* ;
Booleans: TRUE | FALSE ;
Number: IntegerLiteral | FloatLiteral | DoubleLiteral | BinLiteral | HexLiteral | OctLiteral | LongLiteral ;
EscapeSeq: UniCharacterLiteral | EscapedIdentifier;
UniCharacterLiteral: '\\' 'u' HexDigit HexDigit HexDigit HexDigit ;
EscapedIdentifier: '\\' ('t' | 'b' | 'r' | 'n' | '\'' | '"' | '\\' | '$') ;
HexDigit: [0-9a-fA-F] ;
BinDigit: [01] ;
OctDigit: [0-7];
DecDigit: [0-9];
DecDigitNoZero: [1-9];
ID: [a-z] ([a-zA-Z_] | [0-9])*;
TYPE: [A-Z] ([a-zA-Z] | UNDERSCORE | [0-9])* ;
DATATYPE: Number | String | Character | Booleans ;
When you get an error like "Unexpected input 'foo', expected BAR" and you think "But 'foo' is a BAR", the first thing you should do is to print the token stream for your input (you can do this by running grun Symbols tokens -tokens inputfile). If you do this, you'll see that the a in your input is recognized as a HexDigit, not as an ID.
Why does this happen? Because both HexDigit and ID match the input a and ANTLR (like most lexer generators) resolves ambiguities according to the maximal munch rule: When multiple rules can match the current input, it chooses the one that produces the longest match (which is why variables with more than one letter work) and then resolves ties by picking the one that is defined first, which is HexDigit in this case.
Note that the lexer does not care which lexer rules are used by the parser and when. The lexer decides which tokens to produce solely based on the contents of the lexer grammar, so the lexer does not know or care that the parser wants an ID right now. It looks at all rules that match and then picks one according to the maximal munch rule and that's it.
In your case you never actually use HexDigit in your parser grammar, so there is no reason why you'd ever want a HexDigit token to be created. Therefore HexDigit should not be a lexer rule - it should be a fragment:
fragment HexDigit : [0-9a-fA-F];
This also applies to your other rules that aren't used in the parser, including all the ...Digit rules.
PS: Your Number rule will never match because of these same rules. It should probably be a parser rule instead (or the other number rules should be fragments if you don't care which kind of number literal you have).

Exclude tokens from Identifier lexical rule

I have Identifier lexical rule:
Identifier
: ( 'a'..'z' | 'A'..'Z' | '_' ) ( 'a'..'z' | 'A'..'Z' | '_' | '0'..'9' )*
;
LogicalOr and LogicalAnd rules:
LogicalOr : '| ' | '||' | OR;
LogicalAnd : '&' | '&&' | AND;
fragment Or : '[Oo][Rr]';
fragment And : '[Aa][Nn][Dd]';
strings "and" and "or" are recognized as identifiers, instead of logicalAnd and logicalOr. Could someone help me to solve this problem please?
There are two potential issues at play. First and foremost, ANTLR 3 does not support the character class syntax introduced by ANTLR 4. Your Or fragment literally matches the input [Oo][Rr]; it does not match OR, or, or oR. The same applies to your And fragment. You need to write the rule like this instead:
fragment
Or
: ('O' | 'o') ('R' | 'r')
;
If this does not resolve your issue, then you need to make sure your LogicalOr and LogicalAnd rules are positioned before the Identifier rule in the grammar. The rule which appears first will determine what token type is assigned for this input sequence.

Grammar for ANLTR 4

I'm trying to develop a grammar to parse a DSL using ANTLR4 (first attempt at using it)
The grammar itself is somewhat similar to SQL in the sense that should
It should be able to parse commands like the following:
select type1.attribute1 type2./xpath_expression[#id='test 1'] type3.* from source1 source2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where (type1.attribute2 = "XX" AND
(type1.attribute3 <= "2014-01-12T00:00:00.123456+00:00" OR
type2./another_xpath_expression = "YY"))
EDIT: I've updated the grammar switching CHAR, SYMBOL and DIGIT to fragment as suggested by [lucas_trzesniewski], but I did not manage to get improvements.
Attached is the parse tree as suggested by Terence. I get also in the console the following (I'm getting more confused...):
warning(125): API.g4:16:8: implicit definition of token 'CHAR' in parser
warning(125): API.g4:20:31: implicit definition of token 'SYMBOL' in parser
line 1:12 mismatched input 'p' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:19 mismatched input 't' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:27 mismatched input 'm' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:35 mismatched input '#' expecting {NUMBER, CHAR, SYMBOL}
line 1:58 no viable alternative at input 'm'
line 3:13 no viable alternative at input '(deco.m'
I was able to put together the bulk of the grammar, but it fails to properly match all the tokens, therefore resulting in incorrect parsing depending on the complexity of the input.
By browsing on internet it seems to me that the main reason is down to the lexer selecting the longest matching sequence, but even after several attempts of rewriting lexer and grammar rules I could not achieve a robust set.
Below are my grammar and some test cases.
What would be the correct way to specify the rules? should I use lexer modes ?
GRAMMAR
grammar API;
get : K_SELECT (((element) )+ | '*')
'from' (source )+
( K_FROM_DATE dateTimeOffset )? ( K_TO_DATE dateTimeOffset )?
('where' expr )?
EOF
;
element : qualifier DOT attribute;
qualifier : 'raw' | 'std' | 'deco' ;
attribute : ( word | xpath | '*') ;
word : CHAR (CHAR | NUMBER)*;
xpath : (xpathFragment+);
xpathFragment
: '/' ( DOT | CHAR | NUMBER | SYMBOL )+
| '[' (CHAR | NUMBER | SYMBOL )+ ']'
;
source : ( 'system1' | 'system2' | 'ALL') ; // should be generalised.
date : (NUMBER MINUS NUMBER MINUS NUMBER) ;
time : (NUMBER COLON NUMBER (COLON NUMBER ( DOT NUMBER )?)? ( 'Z' | SIGN (NUMBER COLON NUMBER )));
dateTimeOffset : date 'T' time;
filter : (element OP value) ;
value : QUOTE .+? QUOTE ;
expr
: filter
| '(' expr 'AND' expr ')'
| '(' expr 'OR' expr ')'
;
K_SELECT : 'select';
K_RANGE : 'range';
K_FROM_DATE : 'fromDate';
K_TO_DATE : 'toDate' ;
QUOTE : '"' ;
MINUS : '-';
SIGN : '+' | '-';
COLON : ':';
COMMA : ',';
DOT : '.';
OP : '=' | '<' | '<=' | '>' | '>=' | '!=';
NUMBER : DIGIT+;
fragment DIGIT : ('0'..'9');
fragment CHAR : [a-z] | [A-Z] ;
fragment SYMBOL : '#' | [-_=] | '\'' | '/' | '\\' ;
WS : [ \t\r\n]+ -> skip ;
NONWS : ~[ \t\r\n];
TEST 1
select raw./priobj/tradeid/margin[#id='222'] deco.* deco.marginType from system1 system2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where ( deco.marginType >= "MV" AND ( ( raw.CretSysInst = "RMS_EXODUS" OR deco.ExtSysNum <= "1234" ) OR deco.ExtSysStr = "TEST Spaced" ) )
TEST 2
select * from ALL
TEST 3
select deco./xpath/expr/text() deco./xpath/expr[a='3' and b gt '6] raw.* from ALL where raw.attr3 = "myvalue"
The image shows that my grammar is unable to recognise several parts of the commands
What is a bit puzzling me is that the single parts are instead working properly,
e.g. parsing only the 'expr' as shown by the tree below
That kind of thing: word : (CHAR (CHAR | NUMBER)+); is indeed a job for the lexer, not the parser.
This: DIGIT : ('0'..'9'); should be a fragment. Same goes for this: CHAR : [a-z] | [A-Z] ;. That way, you could write NUMBER : CHAR+;, and WORD: CHAR (CHAR | NUMBER)*;
The reason is simple: you want to deal with meaningful tokens in your parser, not with parts of words. Think of the lexer as the thing that will "cut" the input text at meaningful points. Later on, you want to process full words, not individual characters. So think about where is it most meaningful to make those cuts.
Now, as the ANTLR master has pointed out, to debug your problem, dump the parse tree and see what goes on.

How to allow an identifer which can start with a digit without causing MismatchedTokenException

I want to match the following input:
statement span=1m 0_dur=12
with the following grammar:
options {
language = Java;
output=AST;
ASTLabelType=CommonTree;
}
statement :'statement' 'span' '=' INTEGER 'm' ident '=' INTEGER;
INTEGER
: DIGIT+
;
ident : IDENT | 'AVG' | 'COUNT';
IDENT
: (LETTER | DIGIT | '_')+ ;
WHITESPACE
: ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment
LETTER : ('a'..'z' | 'A'..'Z') ;
fragment
DIGIT : '0'..'9';
but it cause an error:
MismatchedTokenException : line 1:15 mismatched input '1m' expecting '\u0004'
Does anyone has any idea how to solve this?
THanks
Charles
I think your grammar is context sensitive, even at the lexical analyser(Tokenizer) level. The string "1m" is recognized as IDENT, not INTEGER followed by 'm'. You either redefine your syntax, or use predicated parsing, or embed Java code in your grammar to detect the context (e.g. If the number is presented after "span" followed by "=", then parse it as INTEGER).

antlr lexer rule matching a prefix of another rule

I am not sure that the issue is actually the prefixes, but here goes.
I have these two rules in my grammar (among many others)
DOT_T : '.' ;
AND_T : '.AND.' | '.and.' ;
and I need to parse strings like this:
a.eq.b.and.c.ne.d
c.append(b)
this should get lexed as:
ID[a] EQ_T ID[b] AND_T ID[c] NE_T ID[d]
ID[c] DOT_T ID[append] LPAREN_T ID[b] RPAREN_T
the error I get for the second line is:
line 1:3 mismatched character "p"; expecting "n"
It doesn't lex the . as a DOT_T but instead tries to match .and. because it sees the a after ..
Any idea on what I need to do to make this work?
UPDATE
I added the following rule and thought I'd use the same trick
NUMBER_T
: DIGIT+
( (DECIMAL)=> DECIMAL
| (KIND)=> KIND
)?
;
fragment DECIMAL
: '.' DIGIT+ ;
fragment KIND
: '.' DIGIT+ '_' (ALPHA+ | DIGIT+) ;
but when I try parsing this:
lda.eq.3.and.dim.eq.3
it gives me the following error:
line 1:9 no viable alternative at character "a"
while lexing the 3. So I'm guessing the same thing is happening as above, but the solution doesn't work in this case :S Now I'm properly confused...
Yes, that is because of the prefixed '.'-s.
Whenever the lexer stumbles upon ".a", it tries to create a AND_T token. If the characters "nd" can then not be found, the lexer tries to construct another token that starts with a ".a", which isn't present (and ANTLR produces an error). So, the lexer will not give back the character "a" and fall back to create a DOT_T token (and then an ID token)! This is how ANTLR works.
What you can do is optionally match these AND_T, EQ_T, ... inside the DOT_T rule. But still, you will need to "help" the lexer a bit by adding some syntactic predicates that force the lexer to look ahead in the character stream to be sure it can match these tokens.
A demo:
grammar T;
parse
: (t=. {System.out.printf("\%-10s '\%s'\n", tokenNames[$t.type], $t.text);})* EOF
;
DOT_T
: '.' ( (AND_T)=> AND_T {$type=AND_T;}
| (EQ_T)=> EQ_T {$type=EQ_T; }
| (NE_T)=> NE_T {$type=NE_T; }
)?
;
ID
: ('a'..'z' | 'A'..'Z')+
;
LPAREN_T
: '('
;
RPAREN_T
: ')'
;
SPACE
: (' ' | '\t' | '\r' | '\n')+ {skip();}
;
NUMBER_T
: DIGIT+ ((DECIMAL)=> DECIMAL)?
;
fragment DECIMAL : '.' DIGIT+ ;
fragment AND_T : ('AND' | 'and') '.' ;
fragment EQ_T : ('EQ' | 'eq' ) '.' ;
fragment NE_T : ('NE' | 'ne' ) '.' ;
fragment DIGIT : '0'..'9';
And if you feed the generated parser the input:
a.eq.b.and.c.ne.d
c.append(b)
the following output will be printed:
ID 'a'
EQ_T '.eq.'
ID 'b'
AND_T '.and.'
ID 'c'
NE_T '.ne.'
ID 'd'
ID 'c'
DOT_T '.'
ID 'append'
LPAREN_T '('
ID 'b'
RPAREN_T ')'
And for the input:
lda.eq.3.and.dim.eq.3
the following is printed:
ID 'lda'
EQ_T '.eq.'
NUMBER_T '3'
AND_T '.and.'
ID 'dim'
EQ_T '.eq.'
NUMBER_T '3'
EDIT
The fact that DECIMAL and KIND both start with '.' DIGIT+ is not good. Try something like this:
NUMBER_T
: DIGIT+ ((DECIMAL)=> DECIMAL ((KIND)=> KIND)?)?
;
fragment DECIMAL : '.' DIGIT+;
fragment KIND : '_' (ALPHA+ | DIGIT+); // removed ('.' DIGIT+) from this fragment
Note that the rule NUMBER_T will now never produce DECIMAL or KIND tokens. If you want that to happen, you need to change the type:
NUMBER_T
: DIGIT+ ((DECIMAL)=> DECIMAL {/*change type*/} ((KIND)=> KIND {/*change type*/})?)?
;