Simple Antlr3 Token parsing - antlr

while i'm somewhat comforted by the amount of questions regarding Antlr grammar (it's not just me trying to shave this yak shaped thing), i haven't found a question/answer that comes close to helping with my issue.
I'm using Antlr3.3 with a mixed Token/Parser lexer.
I'm using gUnit to help prove the grammar, and some jUnit tests; this is where the fun begins.
I have a simple config file i want to parse:
identifier foobar {
port=8080
stub plusone.google.com {
status-code = 206
header = []
body = []
}
}
I'm having trouble parsing the "identifier" (foobar in this example):
Valid names i want to allow are:
foobar
foo-bar
foo_bar
foobar2
foo-bar2
foo_bar2
3foobar
_foo-bar3
and so on, therefore a valid name can use the characters 'a..z'|'A..Z', '0..9' '_' and '-'
The grammar i've arrived at is this (note this isnt the full grammar, just the portion pertinent to this question):
fragment HYPHEN : '-' ;
fragment UNDERSCORE : '_' ;
fragment DIGIT : '0'..'9' ;
fragment LETTER : 'a'..'z' |'A'..'Z' ;
fragment NUMBER : DIGIT+ ;
fragment WORD : LETTER+ ;
IDENTIFIER : DIGIT | LETTER (LETTER | DIGIT | HYPHEN | UNDERSCORE)*;
and the corresponding gUnit test
IDENTIFIER:
"foobar" OK
"foo_bar" OK
"foo-bar" OK
"foobar1" OK
"foobar12" OK
"foo-bar2" OK
"foo_bar2" OK
"foo-bar-2" OK
"foo-bar_2" OK
"5foobar" OK
"f_2-a" OK
"aA0_" OK
// no "funny chars"
"foo#bar" FAIL
// not with whitepsace
"foo bar" FAIL
Running the gUnit tests only fails for "5foobar". I've managed to parse the difficult stuff, and yet the seemingly simple task of parsing an identifier has beaten me.
Can anyone point me to where i'm going wrong? How can i match without being greedy?
Many thanks in advance.
-- UPDATE --
I changed the grammar as per Barts answer, to this:
IDENTIFIER : ('0'..'9'| 'a'..'z'|'A'..'Z' | '_'|'-') ('_'|'-'|'a'..'z'|'A'..'Z'|'0'..'9')* ;
and this fixed the failing gUnit tests, but broke an unreleated jUnit test, that tests the "port" parameter.
The following grammar deals with the "port=8080" element of the config snippet above:
configurationStatement[MiddlemanConfiguration config]
: PORT EQ port=NUMBER {
config.setConfigurationPort(Integer.parseInt(port.getText())); }
| def=proxyDefinition { config.add(def); }
;
The message i get is:
mismatched input '8080' expecting NUMBER
Where NUMBER is defined as NUMBER : ('0'..'9')+ ;
Moving the rule for NUMBER above the IDENTIFIER block, fixed this issue.

IDENTIFIER : DIGIT | LETTER (LETTER | DIGIT | HYPHEN | UNDERSCORE)*;
is equivalent to:
IDENTIFIER
: DIGIT
| LETTER (LETTER | DIGIT | HYPHEN | UNDERSCORE)*
;
So, an IDENTIFIER is eiter a single DIGIT, or starts with a LETTER followed by (LETTER | DIGIT | HYPHEN | UNDERSCORE)*.
You probably meant:
IDENTIFIER
: (DIGIT | LETTER | UNDERSCORE) (LETTER | DIGIT | HYPHEN | UNDERSCORE)*
;
However, that also allows for 3---3 as being a valid IDENTIFIER, is that correct?

Related

How do I force the the parser to match a content as an ID rather than a token?

I have a grammar as the following (It's a partial view with only the relevant parts):
elem_course : INIT_ABSCISSA '=' expression;
expression
: ID
| INT_VALUE
| '(' expression ')'
| expression OPERATOR1 expression
| expression OPERATOR2 expression
;
OPERATOR1 : '*' | '/' ;
OPERATOR2 : '+' | '-' ;
fragment
WORD : LETTER (LETTER | NUM | '_' )*;
ID : WORD;
fragment
NUM : [0-9];
fragment
LETTER : [a-zA-Z];
BEACON_ANTENNA_TRAIN : 'BEACON_ANTENNA_TRAIN';
And, I would like to match the following line :
INIT_ABSCISSA = 40 + BEACON_ANTENNA_TRAIN
But as BEACON_ANTENNA_TRAIN is a lexer token and even the rule states that I except and ID, the parser matchs the token and raise me the following error when parsing:
line 11:29 mismatched input 'BEACON_ANTENNA_TRAIN' expecting {'(', INT_VALUE, ID}
Is there a way to force the parser that it should match the content as an ID rather than a token?
(Quick note: It's nice to abbreviate content in questions, but it really helps if it is functioning, stand-alone content that demonstrates your issue)
In this case, I've had to add the following lever rules to get this to generate, so I'm making some (probably legitimate) assumptions.
INT_VALUE: [\-+]? NUM+;
INIT_ABSCISSA: 'INIT_ABSCISSA';
WS: [ \t\r\n]+ -> skip;
I'm also going to have to assume that BEACON_ANTENNA_TRAIN: 'BEACON_ANTENNA_TRAIN'; appears before your ID rule. As posted your token stream is as follows and could not generate the error you show)
[#0,0:12='INIT_ABSCISSA',<ID>,1:0]
[#1,14:14='=',<'='>,1:14]
[#2,16:17='40',<INT_VALUE>,1:16]
[#3,19:19='+',<OPERATOR2>,1:19]
[#4,21:40='BEACON_ANTENNA_TRAIN',<ID>,1:21]
[#5,41:40='<EOF>',<EOF>,1:41]
If I reorder the lexer rules like this:
INIT_ABSCISSA: 'INIT_ABSCISSA';
BEACON_ANTENNA_TRAIN: 'BEACON_ANTENNA_TRAIN';
OPERATOR1: '*' | '/';
OPERATOR2: '+' | '-';
fragment WORD: LETTER (LETTER | NUM | '_')*;
ID: WORD;
fragment NUM: [0-9];
fragment LETTER: [a-zA-Z];
INT_VALUE: [\-+]? NUM+;
WS: [ \t\r\n]+ -> skip;
I can get your error message.
The lexer looks at you input stream of characters and attempts to match all lexer rules. To choose the token type, ANTLR will:
select the rule that matches the longest stream of input characters
If multiple Lever rules match the same sequence of input characters, then the rule that appears first will be used (that's why I had to re-order the rules to get your error.
With those assumptions, now to your question.
The short answer is "you can't". The Lexer processes input and determines token types before the parser is involved in any way. There is nothing you can do in parser rules to influence Token Type.
The parser, on the other hand starts with the start rule and then uses a recursive descent algorithm to attempt to match your token stream to parser rules.
You don't really give any idea what really guides whether BEACON_ANTENNA_TRAIN should be a BEACON_ANTENNA_TRAIN or an ID, so I'll put an example together that assumes that it's an ID if it's on the right hand side (rhs) of the elemen_course rule.
Then this grammar:
grammar IDG
;
elem_course: INIT_ABSCISSA '=' rhs_expression;
rhs_expression
: id = (ID | BEACON_ANTENNA_TRAIN | INIT_ABSCISSA)
| INT_VALUE
| '(' rhs_expression ')'
| rhs_expression OPERATOR1 rhs_expression
| rhs_expression OPERATOR2 rhs_expression
;
INIT_ABSCISSA: 'INIT_ABSCISSA';
BEACON_ANTENNA_TRAIN: 'BEACON_ANTENNA_TRAIN';
OPERATOR1: '*' | '/';
OPERATOR2: '+' | '-';
fragment WORD: LETTER (LETTER | NUM | '_')*;
ID: WORD;
fragment NUM: [0-9];
fragment LETTER: [a-zA-Z];
INT_VALUE: [\-+]? NUM+;
WS: [ \t\r\n]+ -> skip;
produces this token stream and parse tree:
$ grun IDG elem_course -tokens -tree IDG.txt
[#0,0:12='INIT_ABSCISSA',<'INIT_ABSCISSA'>,1:0]
[#1,14:14='=',<'='>,1:14]
[#2,16:17='40',<INT_VALUE>,1:16]
[#3,19:19='+',<OPERATOR2>,1:19]
[#4,21:40='BEACON_ANTENNA_TRAIN',<'BEACON_ANTENNA_TRAIN'>,1:21]
[#5,41:40='<EOF>',<EOF>,1:41]
(elem_course INIT_ABSCISSA = (rhs_expression (rhs_expression 40) + (rhs_expression BEACON_ANTENNA_TRAIN)))
As a side note: It's possible that, depending on what drives your decision, you might be able to leverage Lexer modes, but there's not anything in your example to leaves that impression.
This is the well known keyword-as-identifier problem and Mike Cargal gave you a working solution. I just want to add that the general approach for this problem is to add all keywords to a parser id rule that should be matched as an id. To restrict which keyword is allowed in certain grammar positions, you can use multiple id rules. For example the MySQL grammar uses this approach to a large extend to define keywords that can go as identifier in general or only as a label, for role names etc.

ANTLR: "for" keyword used for loops conflicts with "for" used in messages

I have the following grammar:
myg : line+ EOF ;
line : ( for_loop | command params ) NEWLINE;
for_loop : FOR WORD INT DO NEWLINE stmt_body;
stmt_body: line+ END;
params : ( param | WHITESPACE)*;
param : WORD | INT;
command : WORD;
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
fragment DIGIT : [0-9] ;
WORD : (LOWERCASE | UPPERCASE | DIGIT | [_."'/\\-])+ (DIGIT)* ;
INT : DIGIT+ ;
WHITESPACE : (' ' | '\t')+ -> skip;
NEWLINE : ('\r'? '\n' | '\r')+ -> skip;
FOR: 'for';
DO: 'do';
END: 'end';
My problem is that the 2 following are valid in this language:
message please wait for 90 seconds
This would be a valid command printing a message with the word "for".
for n 2 do
This would be the beginning of a for loop.
The problem is that with the current lexer it doesn't match the for loop since 'for' is matched by the WORD rule as it appears first.
I could solve that by putting the FOR rule before the WORD rule but then 'for' in message would be matched by the FOR rule
This is the typical keywords versus identifier problem and I thought there were quite a number of questions regarding that here on Stackoverflow. But to my surprise I can only find an old answer of mine for ANTLR3.
Even though the principle mentioned there remains the same, you no longer can change the returned token type in a parser rule, with ANTLR4.
There are 2 steps required to make your scenario work.
Define the keywords before the WORD rule. This way they get own token types you need for grammar parts which require specific keywords.
Add keywords selectively to rules, which parse names, where you want to allow those keywords too.
For the second step modify your rules:
param: WORD | INT | commandKeyword;
command: WORD | commandKeyword;
commandKeyword: FOR | DO | END; // Keywords allowed as names in commands.

Not able to parse continuos string using antlr (without spaces)

I have to parse the following query using antlr
sys_nameLIKEvalue
Here sys_name is a variable which has lower case and underscores.
LIKE is a fixed key word.
value is a variable which can contain lower case uppercase as well as number.
Below the grammer rule i am using
**expression : parameter 'LIKE' values EOF;
parameter : (ID);
ID : (LOWERCASE) (LOWERCASE | UNDERSCORE)* ;
values : (VALUE);
VALUE : (LOWERCASE | NUMBER | UPPERCASE)+ ;
LOWERCASE : 'a'..'z' ;
UPPERCASE : 'A'..'Z' ;
NUMBER : '0'..'9' ;
UNDERSCORE : '_' ;**
Test Case 1
Input : sys_nameLIKEabc
error thrown : line 1:8 missing 'LIKE' at 'LIKEabc'
Test Case 2
Input : sysnameLIKEabc
error thrown : line 1:0 mismatched input 'sysnameLIKEabc' expecting ID
A literal token inside your parser rule will be translated into a plain lexer rule. So, your grammar really looks like this:
expression : parameter LIKE values EOF;
parameter : ID;
values : VALUE;
LIKE : 'LIKE';
ID : LOWERCASE (LOWERCASE | UNDERSCORE)* ;
VALUE : (LOWERCASE | NUMBER | UPPERCASE)+ ;
// Fragment rules will never become tokens of their own: good practice!
fragment LOWERCASE : 'a'..'z' ;
fragment UPPERCASE : 'A'..'Z' ;
fragment NUMBER : '0'..'9' ;
fragment UNDERSCORE : '_' ;
Since lexer rules are greedy, and if two or more lexer rules match the same amount of character the first will "win", your input is tokenized as follows:
Input: sys_nameLIKEabc, 2 tokens:
sys_name: ID
LIKEabc: VALUE
Input: sysnameLIKEabc, 1 token:
sys_nameLIKEabc: VALUE
So, the token LIKE will never be created with your test input, so none of your parser rule will ever match. It also seems a bit odd to parse input without any delimiters, like spaces.
To fix your issue, you will either have to introduce delimiters, or disallow your VALUE to contain uppercases.

ANTLR AST Grammar Issue Mismatched Token Exception

my real grammar is way more complex but I could strip down my problem. So this is the grammar:
grammar test2;
options {language=CSharp3;}
#parser::namespace { Test.Parser }
#lexer::namespace { Test.Parser }
start : 'VERSION' INT INT project;
project : START 'project' NAME TEXT END 'project';
START: '/begin';
END: '/end';
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
INT : '0'..'9'+;
NAME: ('a'..'z' | 'A'..'Z')+;
TEXT : '"' ( '\\' (.) |'"''"' |~( '\\' | '"' | '\n' | '\r' ) )* '"';
STARTA
: '/begin hello';
And I want to parse this (for example):
VERSION 1 1
/begin project
testproject "description goes here"
/end
project
Now it will not work like this (Mismatched token exception). If I remove the last Token STARTA, it works. But why? I don't get it.
Help is really appreciated.
Thanks.
When the lexer sees the input "/begin " (including the space!), it is committed to the rule STARTA. When it can't match said rule, because the next char in the input is a "p" (from "project") and not a "h" (from "hello"), it will try to match another rule that can match "/begin " (including the space!). But there is no such rule, producing the error:
mismatched character 'p' expecting 'h'
and the lexer will not give up the space and match the START rule.
Remember that last part: once the lexer has matched something, it will not give up on it. It might try other rules that match the same input, but it will not backtrack to match a rule that matches less characters!
This is simply how the lexer works in ANTLR 3.x, no way around it.

ANTLR : A lexer or a parser error?

I wrote a simple lexer in ANTLR and the grammer for ID is something like this :
ID : (('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*|'_'('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*);
(No digits are allowed at the beginning)
when I generated the code (in java) and tested the input :
3a
I expected an error but the input was recognized as "INT ID" , how can i fix the grammer to make it report an error ?(with only lexer rules)
Thanks for your attention
Note that your rule could be rewritten into:
ID
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '0'..'9' |'_')*
;
or with fragments (rules that won't produce tokens, but are only used by other lexer rules):
ID
: (Letter | '_') (Letter| Digit |'_')*
;
fragment Letter
: 'a'..'z'
| 'A'..'Z'
;
fragment Digit
: '0'..'9'
;
But if input like "3a" is recognized by your lexer and produces the tokens INT and ID, then you shouldn't change anything. A problem with such input would probably come up in your parser rule(s) because it is semantically incorrect.
If you really want to let the lexer handle this kind of stuff, you could do something like this:
INT
: Digit+ (Letter {/* throw an exception */})?
;
And if you want to allow INT literals to possibly end with a f or L, then you'd first have to inspect the contents of Letter and if it's not "f" or "L", the you throw an exception.