ANTLR : A lexer or a parser error? - antlr

I wrote a simple lexer in ANTLR and the grammer for ID is something like this :
ID : (('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*|'_'('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*);
(No digits are allowed at the beginning)
when I generated the code (in java) and tested the input :
3a
I expected an error but the input was recognized as "INT ID" , how can i fix the grammer to make it report an error ?(with only lexer rules)
Thanks for your attention

Note that your rule could be rewritten into:
ID
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '0'..'9' |'_')*
;
or with fragments (rules that won't produce tokens, but are only used by other lexer rules):
ID
: (Letter | '_') (Letter| Digit |'_')*
;
fragment Letter
: 'a'..'z'
| 'A'..'Z'
;
fragment Digit
: '0'..'9'
;
But if input like "3a" is recognized by your lexer and produces the tokens INT and ID, then you shouldn't change anything. A problem with such input would probably come up in your parser rule(s) because it is semantically incorrect.
If you really want to let the lexer handle this kind of stuff, you could do something like this:
INT
: Digit+ (Letter {/* throw an exception */})?
;
And if you want to allow INT literals to possibly end with a f or L, then you'd first have to inspect the contents of Letter and if it's not "f" or "L", the you throw an exception.

Related

How do I force the the parser to match a content as an ID rather than a token?

I have a grammar as the following (It's a partial view with only the relevant parts):
elem_course : INIT_ABSCISSA '=' expression;
expression
: ID
| INT_VALUE
| '(' expression ')'
| expression OPERATOR1 expression
| expression OPERATOR2 expression
;
OPERATOR1 : '*' | '/' ;
OPERATOR2 : '+' | '-' ;
fragment
WORD : LETTER (LETTER | NUM | '_' )*;
ID : WORD;
fragment
NUM : [0-9];
fragment
LETTER : [a-zA-Z];
BEACON_ANTENNA_TRAIN : 'BEACON_ANTENNA_TRAIN';
And, I would like to match the following line :
INIT_ABSCISSA = 40 + BEACON_ANTENNA_TRAIN
But as BEACON_ANTENNA_TRAIN is a lexer token and even the rule states that I except and ID, the parser matchs the token and raise me the following error when parsing:
line 11:29 mismatched input 'BEACON_ANTENNA_TRAIN' expecting {'(', INT_VALUE, ID}
Is there a way to force the parser that it should match the content as an ID rather than a token?
(Quick note: It's nice to abbreviate content in questions, but it really helps if it is functioning, stand-alone content that demonstrates your issue)
In this case, I've had to add the following lever rules to get this to generate, so I'm making some (probably legitimate) assumptions.
INT_VALUE: [\-+]? NUM+;
INIT_ABSCISSA: 'INIT_ABSCISSA';
WS: [ \t\r\n]+ -> skip;
I'm also going to have to assume that BEACON_ANTENNA_TRAIN: 'BEACON_ANTENNA_TRAIN'; appears before your ID rule. As posted your token stream is as follows and could not generate the error you show)
[#0,0:12='INIT_ABSCISSA',<ID>,1:0]
[#1,14:14='=',<'='>,1:14]
[#2,16:17='40',<INT_VALUE>,1:16]
[#3,19:19='+',<OPERATOR2>,1:19]
[#4,21:40='BEACON_ANTENNA_TRAIN',<ID>,1:21]
[#5,41:40='<EOF>',<EOF>,1:41]
If I reorder the lexer rules like this:
INIT_ABSCISSA: 'INIT_ABSCISSA';
BEACON_ANTENNA_TRAIN: 'BEACON_ANTENNA_TRAIN';
OPERATOR1: '*' | '/';
OPERATOR2: '+' | '-';
fragment WORD: LETTER (LETTER | NUM | '_')*;
ID: WORD;
fragment NUM: [0-9];
fragment LETTER: [a-zA-Z];
INT_VALUE: [\-+]? NUM+;
WS: [ \t\r\n]+ -> skip;
I can get your error message.
The lexer looks at you input stream of characters and attempts to match all lexer rules. To choose the token type, ANTLR will:
select the rule that matches the longest stream of input characters
If multiple Lever rules match the same sequence of input characters, then the rule that appears first will be used (that's why I had to re-order the rules to get your error.
With those assumptions, now to your question.
The short answer is "you can't". The Lexer processes input and determines token types before the parser is involved in any way. There is nothing you can do in parser rules to influence Token Type.
The parser, on the other hand starts with the start rule and then uses a recursive descent algorithm to attempt to match your token stream to parser rules.
You don't really give any idea what really guides whether BEACON_ANTENNA_TRAIN should be a BEACON_ANTENNA_TRAIN or an ID, so I'll put an example together that assumes that it's an ID if it's on the right hand side (rhs) of the elemen_course rule.
Then this grammar:
grammar IDG
;
elem_course: INIT_ABSCISSA '=' rhs_expression;
rhs_expression
: id = (ID | BEACON_ANTENNA_TRAIN | INIT_ABSCISSA)
| INT_VALUE
| '(' rhs_expression ')'
| rhs_expression OPERATOR1 rhs_expression
| rhs_expression OPERATOR2 rhs_expression
;
INIT_ABSCISSA: 'INIT_ABSCISSA';
BEACON_ANTENNA_TRAIN: 'BEACON_ANTENNA_TRAIN';
OPERATOR1: '*' | '/';
OPERATOR2: '+' | '-';
fragment WORD: LETTER (LETTER | NUM | '_')*;
ID: WORD;
fragment NUM: [0-9];
fragment LETTER: [a-zA-Z];
INT_VALUE: [\-+]? NUM+;
WS: [ \t\r\n]+ -> skip;
produces this token stream and parse tree:
$ grun IDG elem_course -tokens -tree IDG.txt
[#0,0:12='INIT_ABSCISSA',<'INIT_ABSCISSA'>,1:0]
[#1,14:14='=',<'='>,1:14]
[#2,16:17='40',<INT_VALUE>,1:16]
[#3,19:19='+',<OPERATOR2>,1:19]
[#4,21:40='BEACON_ANTENNA_TRAIN',<'BEACON_ANTENNA_TRAIN'>,1:21]
[#5,41:40='<EOF>',<EOF>,1:41]
(elem_course INIT_ABSCISSA = (rhs_expression (rhs_expression 40) + (rhs_expression BEACON_ANTENNA_TRAIN)))
As a side note: It's possible that, depending on what drives your decision, you might be able to leverage Lexer modes, but there's not anything in your example to leaves that impression.
This is the well known keyword-as-identifier problem and Mike Cargal gave you a working solution. I just want to add that the general approach for this problem is to add all keywords to a parser id rule that should be matched as an id. To restrict which keyword is allowed in certain grammar positions, you can use multiple id rules. For example the MySQL grammar uses this approach to a large extend to define keywords that can go as identifier in general or only as a label, for role names etc.

Not able to parse continuos string using antlr (without spaces)

I have to parse the following query using antlr
sys_nameLIKEvalue
Here sys_name is a variable which has lower case and underscores.
LIKE is a fixed key word.
value is a variable which can contain lower case uppercase as well as number.
Below the grammer rule i am using
**expression : parameter 'LIKE' values EOF;
parameter : (ID);
ID : (LOWERCASE) (LOWERCASE | UNDERSCORE)* ;
values : (VALUE);
VALUE : (LOWERCASE | NUMBER | UPPERCASE)+ ;
LOWERCASE : 'a'..'z' ;
UPPERCASE : 'A'..'Z' ;
NUMBER : '0'..'9' ;
UNDERSCORE : '_' ;**
Test Case 1
Input : sys_nameLIKEabc
error thrown : line 1:8 missing 'LIKE' at 'LIKEabc'
Test Case 2
Input : sysnameLIKEabc
error thrown : line 1:0 mismatched input 'sysnameLIKEabc' expecting ID
A literal token inside your parser rule will be translated into a plain lexer rule. So, your grammar really looks like this:
expression : parameter LIKE values EOF;
parameter : ID;
values : VALUE;
LIKE : 'LIKE';
ID : LOWERCASE (LOWERCASE | UNDERSCORE)* ;
VALUE : (LOWERCASE | NUMBER | UPPERCASE)+ ;
// Fragment rules will never become tokens of their own: good practice!
fragment LOWERCASE : 'a'..'z' ;
fragment UPPERCASE : 'A'..'Z' ;
fragment NUMBER : '0'..'9' ;
fragment UNDERSCORE : '_' ;
Since lexer rules are greedy, and if two or more lexer rules match the same amount of character the first will "win", your input is tokenized as follows:
Input: sys_nameLIKEabc, 2 tokens:
sys_name: ID
LIKEabc: VALUE
Input: sysnameLIKEabc, 1 token:
sys_nameLIKEabc: VALUE
So, the token LIKE will never be created with your test input, so none of your parser rule will ever match. It also seems a bit odd to parse input without any delimiters, like spaces.
To fix your issue, you will either have to introduce delimiters, or disallow your VALUE to contain uppercases.

How to allow an identifer which can start with a digit without causing MismatchedTokenException

I want to match the following input:
statement span=1m 0_dur=12
with the following grammar:
options {
language = Java;
output=AST;
ASTLabelType=CommonTree;
}
statement :'statement' 'span' '=' INTEGER 'm' ident '=' INTEGER;
INTEGER
: DIGIT+
;
ident : IDENT | 'AVG' | 'COUNT';
IDENT
: (LETTER | DIGIT | '_')+ ;
WHITESPACE
: ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment
LETTER : ('a'..'z' | 'A'..'Z') ;
fragment
DIGIT : '0'..'9';
but it cause an error:
MismatchedTokenException : line 1:15 mismatched input '1m' expecting '\u0004'
Does anyone has any idea how to solve this?
THanks
Charles
I think your grammar is context sensitive, even at the lexical analyser(Tokenizer) level. The string "1m" is recognized as IDENT, not INTEGER followed by 'm'. You either redefine your syntax, or use predicated parsing, or embed Java code in your grammar to detect the context (e.g. If the number is presented after "span" followed by "=", then parse it as INTEGER).

ANTLR - Semantic predicate and LL(1)

I want to make a LL(1) grammer in ANTLR that allows a multiple assigment, like:
x = y = 5;
I think semantic predicate are usefull in this situation, but the following rules won't work :(
tokens {
BECOMES = '='
}
assignment_statement
: IDENTIFIER BECOMES expr
;
expr
: (IDENTIFIER BECOMES)=> IDENTIFIER BECOMES expr
| expr_or
;
IDENTIFIER
: LETTER (LETTER | DIGIT)*
;
ANTLRWORKS gives a NoViableAltException.
Do you know what I did wrong and how to make this work?
Thank you!
A grammar with a syntactic (not semantic) predicate that looks ahead 2 tokens isn't LL(1), of course.
But, you don't need a predicate, simply do something like this:
grammar T;
options {
output=AST;
}
tokens {
BECOMES = '=';
}
assignment_statement
: (IDENTIFIER BECOMES)+ expr ';'
;
expr
: IDENTIFIER
| NUMBER
;
IDENTIFIER
: LETTER (LETTER | DIGIT)*
;
NUMBER
: DIGIT+
;
fragment LETTER : 'a'..'z' | 'A'..'Z';
fragment DIGIT : '0'..'9';
which would parse the input "x=y=5;" as follows:
but would reject input like "x=2=3;".
Also, ANTLRWorks' interpreter doesn't work with any kind of predicate: use ANTLRWorks' debugger instead.

Parse sentences with different word types

I'm looking for a grammar for analyzing two type of sentences, that
means words separated by white spaces:
ID1: sentences with words not beginning with numbers
ID2: sentences with words not beginning with numbers and numbers
Basically, the structure of the grammar should look like
ID1 separator ID2
ID1: Word can contain number like Var1234 but not start with a number
ID2: Same as above but 1234 is allowed
separator: e. g. '='
#Bart
I just tried to add two tokens '_' and '"' as lexer-rule Special for later use in lexer-rule Word.
Even I haven't used Special in the following grammar, I get the following error in ANTLRWorks 1.4.2:
The following token definitions can never be matched because prior tokens match the same input: Special
But when I add fragment before Special, I don't get that error. Why?
grammar Sentence1b1;
tokens
{
TCUnderscore = '_' ;
TCQuote = '"' ;
}
assignment
: id1 '=' id2
;
id1
: Word+
;
id2
: ( Word | Int )+
;
Int
: Digit+
;
// A word must start with a letter
Word
: ( 'a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | Digit )*
;
Special
: ( TCUnderscore | TCQuote )
;
Space
: ( ' ' | '\t' | '\r' | '\n' ) { $channel = HIDDEN; }
;
fragment Digit
: '0'..'9'
;
Lexer-rule Special shall then be used in lexer-rule Word:
Word
: ( 'a'..'z' | 'A'..'Z' | Special ) ('a'..'z' | 'A'..'Z' | Special | Digit )*
;
I'd go for something like this:
grammar Sentence;
assignment
: id1 '=' id2
;
id1
: Word+
;
id2
: (Word | Int)+
;
Int
: Digit+
;
// A word must start with a letter
Word
: ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | Digit)*
;
Space
: (' ' | '\t' | '\r' | '\n') {skip();}
;
fragment Digit
: '0'..'9'
;
which will parse the input:
Word can contain number like Var1234 but not start with a number = Same as above but 1234 is allowed
as follows:
EDIT
To keep lexer rule nicely packed together, I'd keep them all at the bottom of the grammar instead of partly in the tokens { ... } block, which I only use for defining "imaginary tokens" (used in AST creation):
// wrong!
Special : (TCUnderscore | TCQuote);
TCUnderscore : '_';
TCQuote : '"';
Now, with the rules above, TCUnderscore and TCQuote can never become a token because when the lexer stumbles upon a _ or ", a Special token is created. Or in this case:
// wrong!
TCUnderscore : '_';
TCQuote : '"';
Special : (TCUnderscore | TCQuote);
the Special token can never be created because the lexer would first create TCUnderscore and TCQuote tokens. Hence the error:
The following token definitions can never be matched because prior tokens match the same input: ...
If you make TCUnderscore and TCQuote a fragment rule, you don't have that problem because fragment rules only "serve" other lexer rules. So this works:
// good!
Special : (TCUnderscore | TCQuote);
fragment TCUnderscore : '_';
fragment TCQuote : '"';
Also, fragment rules can therefor never be "visible" in any of your parser rules (the lexer will never create a TCUnderscore or TCQuote token!).
// wrong!
parse : TCUnderscore;
Special : (TCUnderscore | TCQuote);
fragment TCUnderscore : '_';
fragment TCQuote : '"';
I'm not sure if that fits your needs but with Bart's help in my post
ANTLR - identifier with whitespace
i came to this grammar:
grammar PropertyAssignment;
assignment
: id_nodigitstart '=' id_digitstart EOF
;
id_nodigitstart
: ID_NODIGITSTART+
;
id_digitstart
: (ID_DIGITSTART|ID_NODIGITSTART)+
;
ID_NODIGITSTART
: ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9')*
;
ID_DIGITSTART
: ('0'..'9'|'a'..'z'|'A'..'Z')+
;
WS : (' ')+ {skip();}
;
"a name = my 4value" works while "4a name = my 4value" causes an exception.